MLorc: Momentum Low-rank Compression for Memory Efficient Large Language Model Adaptation

Cong Shen; Jiawei Zhang; Mengfan Xu; Minhui Huang; Wei Shen; Zhang Yaxiang

arxiv: 2506.01897 · v5 · submitted 2025-06-02 · 💻 cs.LG · cs.IT· math.IT· math.OC

MLorc: Momentum Low-rank Compression for Memory Efficient Large Language Model Adaptation

Wei Shen , Zhang Yaxiang , Minhui Huang , Mengfan Xu , Jiawei Zhang , Cong Shen This is my paper

Pith reviewed 2026-05-19 10:46 UTC · model grok-4.3

classification 💻 cs.LG cs.ITmath.ITmath.OC

keywords large language modelsfine-tuningmemory efficiencylow-rank compressionmomentumparameter efficientoptimizer

0 comments

The pith

Compressing the momentum of parameters enables full fine-tuning of large language models with reduced memory.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that applying low-rank compression to the momentum in the training optimizer, rather than to gradients or updates, reduces memory requirements for adapting large language models. This allows the model to learn updates across its entire parameter space instead of being confined to low-rank changes. A reader would care because it promises to close the gap between efficient methods and full fine-tuning without sacrificing much performance or adding computational cost. The work includes both a proof of convergence and experiments showing it can match or beat full tuning at small ranks like 4.

Core claim

MLorc compresses and reconstructs the momentum of matrix parameters to lower memory use in LLM fine-tuning. By targeting momentum instead of gradients, it maintains the training dynamics of full-parameter updates better than existing alternatives. This results in performance that matches or exceeds full fine-tuning at small ranks, works with different optimizers, and comes with a convergence guarantee under mild assumptions.

What carries the argument

Low-rank compression applied to the momentum terms of the optimizer for matrix parameters, which reduces storage needs while allowing reconstruction for the update step.

If this is right

Updates can occur over the full set of parameters rather than a constrained low-rank space.
Optimizer behavior stays closer to standard full fine-tuning than gradient compression does.
Small ranks suffice to reach or surpass the accuracy of complete parameter updates.
Compatibility holds for a range of common optimizers used in practice.
Overall training time and memory footprint remain competitive with other efficient methods.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Extending momentum compression to include variance terms in adaptive methods like Adam could yield further memory savings.
The same principle might apply to fine-tuning in other domains such as vision or reinforcement learning models.
Hardware constraints that currently limit model size for adaptation could be relaxed if this approach scales reliably.

Load-bearing premise

Compressing and reconstructing momentum preserves the essential training dynamics of full-parameter fine-tuning sufficiently well that the method remains stable and effective.

What would settle it

A benchmark run where MLorc at a small rank like 4 produces noticeably worse results than full fine-tuning on standard language tasks would indicate that the momentum reconstruction does not preserve dynamics adequately.

Figures

Figures reproduced from arXiv: 2506.01897 by Cong Shen, Jiawei Zhang, Mengfan Xu, Minhui Huang, Wei Shen, Zhang Yaxiang.

**Figure 2.** Figure 2: Training Loss of AdamW of different methods [PITH_FULL_IMAGE:figures/full_fig_p007_2.png] view at source ↗

**Figure 3.** Figure 3: Training Loss of Full Lion and Lion with MLorc [PITH_FULL_IMAGE:figures/full_fig_p008_3.png] view at source ↗

**Figure 4.** Figure 4: Ratio of top-8 singular values to total singular values for gradient, first moment, and second [PITH_FULL_IMAGE:figures/full_fig_p018_4.png] view at source ↗

read the original abstract

With increasing size of large language models (LLMs), full-parameter fine-tuning imposes substantial memory demands. To alleviate this, we propose a novel memory-efficient training paradigm called Momentum Low-rank compression (MLorc). The key idea of MLorc is to compress and reconstruct the momentum of matrix parameters during training to reduce memory consumption. Compared to LoRA, MLorc avoids enforcing a fixed-rank constraint on weight update matrices and thus enables full-parameter learning. Compared to GaLore, MLorc directly compress the momentum rather than gradients, thereby better preserving the training dynamics of full-parameter fine-tuning. We provide a theoretical guarantee for its convergence under mild assumptions. Empirically, MLorc consistently outperforms other memory-efficient training methods, matches or even exceeds the performance of full fine-tuning at small ranks (e.g., $r=4$), and generalizes well across different optimizers, all while not compromising time or memory efficiency.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

MLorc compresses momentum rather than gradients for low-rank memory savings in LLM fine-tuning, but the claim of matching full fine-tuning at r=4 rests on untested assumptions about accumulated approximation error.

read the letter

The core move here is to apply low-rank compression directly to the momentum buffer instead of to gradients or to the weight updates themselves. That distinction from GaLore and LoRA is the main novelty, and it lets the method keep full-parameter updates while cutting memory on the optimizer state. The paper shows a convergence result under mild assumptions and reports that at rank 4 the method matches or beats full fine-tuning on the tasks they tried, while staying efficient across Adam and other optimizers. Those empirical comparisons are the strongest part of the work; they include direct head-to-head numbers against LoRA and GaLore variants. The soft spot is exactly the one the stress-test flagged. Momentum is an exponentially weighted history, so even small per-step rank-r truncation can drift the effective direction over many steps, especially once the model has left the pretrained basin. The convergence guarantee is stated but the proof sketch does not appear to bound the extra error term coming from the momentum reconstruction itself, and the experiments do not include long-horizon ablations or very large models that would expose accumulation. If the truncation error stays small in practice, the method is useful; if not, the parity claim weakens. This is the kind of paper that belongs in a reading group focused on efficient training: practitioners who already use LoRA or GaLore and want to see whether momentum compression buys them anything real. It is coherent on its own terms and engages the relevant prior work, so it deserves a serious referee even though the analysis of the approximation error will probably need tightening. I would send it out for review rather than desk-reject.

Referee Report

2 major / 2 minor

Summary. The paper proposes MLorc, a memory-efficient fine-tuning method for LLMs that compresses and reconstructs the momentum buffers of matrix parameters using low-rank approximation. Unlike LoRA, it avoids imposing a fixed low-rank constraint on the weight updates themselves, enabling full-parameter adaptation. Unlike GaLore, it compresses momentum rather than gradients to better preserve optimizer dynamics. The authors provide a convergence guarantee under mild assumptions and report empirical results showing consistent outperformance over other efficient methods, with performance matching or exceeding full fine-tuning at small ranks (e.g., r=4) across optimizers, without extra time or memory cost.

Significance. If the central empirical claims hold under rigorous controls and the theoretical assumptions prove robust to rank-r truncation error on momentum, the work could meaningfully advance practical LLM adaptation by reducing memory footprint while retaining full-parameter expressivity. The cross-optimizer generalization and direct momentum compression are potentially valuable distinctions from prior art.

major comments (2)

[§4] §4 (Convergence Analysis): The stated guarantee relies on 'mild assumptions' about the optimizer trajectory, but the manuscript does not quantify how the per-step low-rank reconstruction error on the momentum buffer (an exponentially weighted moving average) propagates over many steps or affects the assumptions when the model has moved far from initialization. This is load-bearing for the 'matches full fine-tuning' claim at r=4.
[Table 3] Table 3 / Experimental Results (r=4 rows): The reported performance parity with full fine-tuning lacks explicit statistical testing (e.g., multiple random seeds with error bars) and does not isolate whether the momentum compression error accumulates differently across tasks or model scales, weakening the generalization claim.

minor comments (2)

[§3] Notation for the reconstruction operator and the exact low-rank factorization (e.g., whether SVD or randomized projection is used) should be defined once in §3 and used consistently thereafter.
[Abstract] The abstract and introduction should explicitly state the memory savings relative to full fine-tuning and to GaLore at equivalent rank to allow direct comparison.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the careful reading and constructive comments on our work. The feedback on the convergence analysis and experimental validation is valuable, and we have revised the manuscript accordingly to strengthen these aspects while preserving the core contributions of MLorc.

read point-by-point responses

Referee: [§4] §4 (Convergence Analysis): The stated guarantee relies on 'mild assumptions' about the optimizer trajectory, but the manuscript does not quantify how the per-step low-rank reconstruction error on the momentum buffer (an exponentially weighted moving average) propagates over many steps or affects the assumptions when the model has moved far from initialization. This is load-bearing for the 'matches full fine-tuning' claim at r=4.

Authors: We appreciate the referee's point on the need for more explicit analysis of error propagation. Our convergence guarantee follows standard assumptions on loss smoothness and gradient boundedness that are common in adaptive optimizer analyses (e.g., for Adam). Because momentum is an exponentially weighted moving average, reconstruction errors from earlier steps are discounted by the decay factor β (typically 0.9), which limits long-term accumulation. In the revised manuscript we have added a remark and supporting bound in §4 that quantifies the total deviation in the momentum buffer as a function of per-step rank-r error and β, showing the accumulated effect remains controlled even after many steps and when the trajectory has moved from initialization. This directly supports the empirical parity at r=4 without requiring stronger assumptions. revision: yes
Referee: [Table 3] Table 3 / Experimental Results (r=4 rows): The reported performance parity with full fine-tuning lacks explicit statistical testing (e.g., multiple random seeds with error bars) and does not isolate whether the momentum compression error accumulates differently across tasks or model scales, weakening the generalization claim.

Authors: We agree that explicit statistical testing and broader controls would improve the presentation. In the revised version we have updated Table 3 to include means and standard deviations computed over five independent random seeds for all r=4 entries. We have also added a new subsection with results on additional model scales (7B and 13B) and a wider set of tasks, confirming that momentum compression error does not accumulate in a manner that degrades performance differently across these settings. These controls reinforce the generalization of the observed parity with full fine-tuning. revision: yes

Circularity Check

0 steps flagged

No load-bearing circularity; convergence claim and empirical parity rest on independent compression design rather than self-definition or fitted inputs

full rationale

The paper's core derivation introduces MLorc as a direct low-rank compression applied to the momentum buffer (rather than gradients or weights), with a stated convergence guarantee under mild assumptions and empirical comparisons to full fine-tuning and baselines like LoRA/GaLore. No equations or claims in the abstract reduce the performance parity result to a quantity defined by the method's own fitted parameters or by renaming an input. The theoretical guarantee is presented as an external analysis rather than a self-referential identity, and no self-citations are invoked as load-bearing uniqueness theorems. This yields a low circularity score consistent with an independent algorithmic contribution whose validity is tested externally via experiments.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The approach rests on the unstated details of how momentum is compressed and reconstructed, plus the mild assumptions needed for the convergence proof.

free parameters (1)

compression rank r
The low-rank dimension used for momentum compression is a tunable hyperparameter (example value r=4 given).

axioms (1)

domain assumption Mild assumptions suffice for convergence of the compressed-momentum optimizer
Abstract states a theoretical guarantee holds under mild assumptions, but the assumptions themselves are not enumerated.

pith-pipeline@v0.9.0 · 5704 in / 968 out tokens · 24129 ms · 2026-05-19T10:46:50.344343+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

We propose Momentum Low-rank compression (MLorc)... directly compress the momentum rather than gradients... Assumption 3.3... approximate low-rank structure of momentum
IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Theorem 3.4 (informal)... matching the same convergence rate as the original Lion

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

20 extracted references · 20 canonical work pages · 8 internal anchors

[1]

Language models are few-shot learners

Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. Language models are few-shot learners. Advances in neural information processing systems , 33:1877–1901,

work page 1901
[2]

Evaluating Large Language Models Trained on Code

9 Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan, Henrique Ponde De Oliveira Pinto, Jared Kaplan, Harri Edwards, Yuri Burda, Nicholas Joseph, Greg Brockman, et al. Evaluating large language models trained on code. arXiv preprint arXiv:2107.03374 ,

work page internal anchor Pith review Pith/arXiv arXiv
[3]

Training Deep Nets with Sublinear Memory Cost

Tianqi Chen, Bing Xu, Chiyuan Zhang, and Carlos Guestrin. Training deep nets with sublinear memory cost. arXiv preprint arXiv:1604.06174 ,

work page internal anchor Pith review Pith/arXiv arXiv
[4]

Training Verifiers to Solve Math Word Problems

Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Mark Chen, Heewoo Jun, Lukasz Kaiser, Matthias Plappert, Jerry Tworek, Jacob Hilton, Reiichiro Nakano, et al. Training verifiers to solve math word problems. arXiv preprint arXiv:2110.14168 ,

work page internal anchor Pith review Pith/arXiv arXiv
[5]

Convergence rate analysis of lion

Yiming Dong, Huan Li, and Zhouchen Lin. Convergence rate analysis of lion. arXiv preprint arXiv:2411.07724,

work page arXiv
[6]

A Rank Stabilization Scaling Factor for Fine-Tuning with LoRA

Damjan Kalajdzievski. A rank stabilization scaling factor for fine-tuning with lora. arXiv preprint arXiv:2312.03732,

work page internal anchor Pith review Pith/arXiv arXiv
[7]

MixLoRA: Enhancing large language models fine-tuning with LoRA-based mixture of experts.arXiv preprint arXiv:2404.15159, 2024

Dengchun Li, Yingzi Ma, Naizheng Wang, Zhengmao Ye, Zhiyuan Cheng, Yinghao Tang, Yan Zhang, Lei Duan, Jie Zuo, Cal Yang, et al. Mixlora: Enhancing large language models fine-tuning with lora-based mixture of experts. arXiv preprint arXiv:2404.15159 ,

work page arXiv
[8]

Eora: Training- free compensation for compressed llm with eigenspace low-rank approximation

Shih-Yang Liu, Maksim Khadkevich, Nai Chit Fung, Charbel Sakr, Chao-Han Huck Yang, Chien-Yi Wang, Saurav Muralidharan, Hongxu Yin, Kwang-Ting Cheng, Jan Kautz, et al. Eora: Training- free compensation for compressed llm with eigenspace low-rank approximation. arXiv preprint arXiv:2410.21271, 2024a. Shih-Yang Liu, Chien-Yi Wang, Hongxu Yin, Pavlo Molchanov...

work page arXiv 1907
[9]

Subtrack your grad: Gradient subspace track- ing for memory and time efficient full-parameter llm training

Sahar Rajabi, Nayeema Nonta, and Sirisha Rambhatla. Subtrack your grad: Gradient subspace track- ing for memory and time efficient full-parameter llm training. arXiv preprint arXiv:2502.01586 ,

work page arXiv
[10]

On the Convergence of Adam and Beyond

Sashank J Reddi, Satyen Kale, and Sanjiv Kumar. On the convergence of adam and beyond. arXiv preprint arXiv:1904.09237,

work page internal anchor Pith review Pith/arXiv arXiv 1904
[11]

Llama 2: Open Foundation and Fine-Tuned Chat Models

Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Amjad Almahairi, Yasmine Babaei, Nikolay Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti Bhosale, et al. Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288 ,

work page internal anchor Pith review Pith/arXiv arXiv
[12]

Glue: A multi-task benchmark and analysis platform for natural language understanding

Alex Wang, Amanpreet Singh, Julian Michael, Felix Hill, Omer Levy, and Samuel Bowman. Glue: A multi-task benchmark and analysis platform for natural language understanding. In Proceedings of the 2018 EMNLP Workshop BlackboxNLP: Analyzing and Interpreting Neural Networks for NLP , pages 353–355,

work page 2018
[13]

Dobi-svd: Differentiable svd for llm compression and some new perspectives.arXiv preprint arXiv:2502.02723, 2025

Bohan Wang, Huishuai Zhang, Zhiming Ma, and Wei Chen. Convergence of adagrad for non-convex objectives: Simple proofs and relaxed assumptions. In The Thirty Sixth Annual Conference on Learning Theory, pages 161–190. PMLR, 2023a. Qinsi Wang, Jinghan Ke, Masayoshi Tomizuka, Yiran Chen, Kurt Keutzer, and Chenfeng Xu. Dobi-svd: Differentiable svd for llm comp...

work page arXiv
[14]

Multilora: Democratizing lora for better multi-task learning

Yiming Wang, Yu Lin, Xiaodong Zeng, and Guannan Zhang. Multilora: Democratizing lora for better multi-task learning. arXiv preprint arXiv:2311.11501 , 2023b. Wenhan Xia, Chengwei Qin, and Elad Hazan. Chain of lora: Efficient fine-tuning of language models via residual learning. In ICML 2024 Workshop on LLMs and Cognition ,

work page arXiv 2024
[15]

Sparse gradient compression for fine-tuning large language models

David H Yang, Mohammad Mohammadi Amiri, Tejaswini Pedapati, Subhajit Chaudhury, and Pin- Yu Chen. Sparse gradient compression for fine-tuning large language models. arXiv preprint arXiv:2502.00311,

work page arXiv
[16]

MetaMath: Bootstrap Your Own Mathematical Questions for Large Language Models

Longhui Yu, Weisen Jiang, Han Shi, Jincheng Yu, Zhengying Liu, Yu Zhang, James T Kwok, Zhenguo Li, Adrian Weller, and Weiyang Liu. Metamath: Bootstrap your own mathematical questions for large language models. arXiv preprint arXiv:2309.12284 ,

work page internal anchor Pith review Pith/arXiv arXiv
[17]

LoRA-FA: Efficient and Effective Low Rank Representation Fine-tuning

Longteng Zhang, Lin Zhang, Shaohuai Shi, Xiaowen Chu, and Bo Li. Lora-fa: Memory-efficient low- rank adaptation for large language models fine-tuning. arXiv preprint arXiv:2308.03303 , 2023a. 11 Qingru Zhang, Minshuo Chen, Alexander Bukharin, Nikos Karampatziakis, Pengcheng He, Yu Cheng, Weizhu Chen, and Tuo Zhao. Adalora: Adaptive budget allocation for p...

work page internal anchor Pith review Pith/arXiv arXiv
[18]

Opencodeinterpreter: Integrating code generation with execution and refinement

Tianyu Zheng, Ge Zhang, Tianhao Shen, Xueling Liu, Bill Yuchen Lin, Jie Fu, Wenhu Chen, and Xiang Yue. Opencodeinterpreter: Integrating code generation with execution and refinement. In Findings of the Association for Computational Linguistics ACL 2024 , pages 12834–12859,

work page 2024
[19]

Delta-lora: Fine- tuning high-rank parameters with the delta of low-rank matrices

Bojia Zi, Xianbiao Qi, Lingzhi Wang, Jianan Wang, Kam-Fai Wong, and Lei Zhang. Delta-lora: Fine- tuning high-rank parameters with the delta of low-rank matrices. arXiv preprint arXiv:2309.02411,

work page arXiv
[20]

tX k=2 βt−k 2 ξk−1 F # + (1 − β1)E

12 Appendix Appendix is organized as follows. Appendix A introduces details on RSVD. Appendix B provides a complete proof of Theorem 3.4. Appendix C presents additional experimental evidence on the low- rank structure and memory efficiency of MLorc. Appendix D gives detailed hyperparameter settings of experiments in Section 4 for reproducibility. A Detail...

work page 2011

[1] [1]

Language models are few-shot learners

Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. Language models are few-shot learners. Advances in neural information processing systems , 33:1877–1901,

work page 1901

[2] [2]

Evaluating Large Language Models Trained on Code

9 Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan, Henrique Ponde De Oliveira Pinto, Jared Kaplan, Harri Edwards, Yuri Burda, Nicholas Joseph, Greg Brockman, et al. Evaluating large language models trained on code. arXiv preprint arXiv:2107.03374 ,

work page internal anchor Pith review Pith/arXiv arXiv

[3] [3]

Training Deep Nets with Sublinear Memory Cost

Tianqi Chen, Bing Xu, Chiyuan Zhang, and Carlos Guestrin. Training deep nets with sublinear memory cost. arXiv preprint arXiv:1604.06174 ,

work page internal anchor Pith review Pith/arXiv arXiv

[4] [4]

Training Verifiers to Solve Math Word Problems

Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Mark Chen, Heewoo Jun, Lukasz Kaiser, Matthias Plappert, Jerry Tworek, Jacob Hilton, Reiichiro Nakano, et al. Training verifiers to solve math word problems. arXiv preprint arXiv:2110.14168 ,

work page internal anchor Pith review Pith/arXiv arXiv

[5] [5]

Convergence rate analysis of lion

Yiming Dong, Huan Li, and Zhouchen Lin. Convergence rate analysis of lion. arXiv preprint arXiv:2411.07724,

work page arXiv

[6] [6]

A Rank Stabilization Scaling Factor for Fine-Tuning with LoRA

Damjan Kalajdzievski. A rank stabilization scaling factor for fine-tuning with lora. arXiv preprint arXiv:2312.03732,

work page internal anchor Pith review Pith/arXiv arXiv

[7] [7]

MixLoRA: Enhancing large language models fine-tuning with LoRA-based mixture of experts.arXiv preprint arXiv:2404.15159, 2024

Dengchun Li, Yingzi Ma, Naizheng Wang, Zhengmao Ye, Zhiyuan Cheng, Yinghao Tang, Yan Zhang, Lei Duan, Jie Zuo, Cal Yang, et al. Mixlora: Enhancing large language models fine-tuning with lora-based mixture of experts. arXiv preprint arXiv:2404.15159 ,

work page arXiv

[8] [8]

Eora: Training- free compensation for compressed llm with eigenspace low-rank approximation

Shih-Yang Liu, Maksim Khadkevich, Nai Chit Fung, Charbel Sakr, Chao-Han Huck Yang, Chien-Yi Wang, Saurav Muralidharan, Hongxu Yin, Kwang-Ting Cheng, Jan Kautz, et al. Eora: Training- free compensation for compressed llm with eigenspace low-rank approximation. arXiv preprint arXiv:2410.21271, 2024a. Shih-Yang Liu, Chien-Yi Wang, Hongxu Yin, Pavlo Molchanov...

work page arXiv 1907

[9] [9]

Subtrack your grad: Gradient subspace track- ing for memory and time efficient full-parameter llm training

Sahar Rajabi, Nayeema Nonta, and Sirisha Rambhatla. Subtrack your grad: Gradient subspace track- ing for memory and time efficient full-parameter llm training. arXiv preprint arXiv:2502.01586 ,

work page arXiv

[10] [10]

On the Convergence of Adam and Beyond

Sashank J Reddi, Satyen Kale, and Sanjiv Kumar. On the convergence of adam and beyond. arXiv preprint arXiv:1904.09237,

work page internal anchor Pith review Pith/arXiv arXiv 1904

[11] [11]

Llama 2: Open Foundation and Fine-Tuned Chat Models

Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Amjad Almahairi, Yasmine Babaei, Nikolay Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti Bhosale, et al. Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288 ,

work page internal anchor Pith review Pith/arXiv arXiv

[12] [12]

Glue: A multi-task benchmark and analysis platform for natural language understanding

Alex Wang, Amanpreet Singh, Julian Michael, Felix Hill, Omer Levy, and Samuel Bowman. Glue: A multi-task benchmark and analysis platform for natural language understanding. In Proceedings of the 2018 EMNLP Workshop BlackboxNLP: Analyzing and Interpreting Neural Networks for NLP , pages 353–355,

work page 2018

[13] [13]

Dobi-svd: Differentiable svd for llm compression and some new perspectives.arXiv preprint arXiv:2502.02723, 2025

Bohan Wang, Huishuai Zhang, Zhiming Ma, and Wei Chen. Convergence of adagrad for non-convex objectives: Simple proofs and relaxed assumptions. In The Thirty Sixth Annual Conference on Learning Theory, pages 161–190. PMLR, 2023a. Qinsi Wang, Jinghan Ke, Masayoshi Tomizuka, Yiran Chen, Kurt Keutzer, and Chenfeng Xu. Dobi-svd: Differentiable svd for llm comp...

work page arXiv

[14] [14]

Multilora: Democratizing lora for better multi-task learning

Yiming Wang, Yu Lin, Xiaodong Zeng, and Guannan Zhang. Multilora: Democratizing lora for better multi-task learning. arXiv preprint arXiv:2311.11501 , 2023b. Wenhan Xia, Chengwei Qin, and Elad Hazan. Chain of lora: Efficient fine-tuning of language models via residual learning. In ICML 2024 Workshop on LLMs and Cognition ,

work page arXiv 2024

[15] [15]

Sparse gradient compression for fine-tuning large language models

David H Yang, Mohammad Mohammadi Amiri, Tejaswini Pedapati, Subhajit Chaudhury, and Pin- Yu Chen. Sparse gradient compression for fine-tuning large language models. arXiv preprint arXiv:2502.00311,

work page arXiv

[16] [16]

MetaMath: Bootstrap Your Own Mathematical Questions for Large Language Models

Longhui Yu, Weisen Jiang, Han Shi, Jincheng Yu, Zhengying Liu, Yu Zhang, James T Kwok, Zhenguo Li, Adrian Weller, and Weiyang Liu. Metamath: Bootstrap your own mathematical questions for large language models. arXiv preprint arXiv:2309.12284 ,

work page internal anchor Pith review Pith/arXiv arXiv

[17] [17]

LoRA-FA: Efficient and Effective Low Rank Representation Fine-tuning

Longteng Zhang, Lin Zhang, Shaohuai Shi, Xiaowen Chu, and Bo Li. Lora-fa: Memory-efficient low- rank adaptation for large language models fine-tuning. arXiv preprint arXiv:2308.03303 , 2023a. 11 Qingru Zhang, Minshuo Chen, Alexander Bukharin, Nikos Karampatziakis, Pengcheng He, Yu Cheng, Weizhu Chen, and Tuo Zhao. Adalora: Adaptive budget allocation for p...

work page internal anchor Pith review Pith/arXiv arXiv

[18] [18]

Opencodeinterpreter: Integrating code generation with execution and refinement

Tianyu Zheng, Ge Zhang, Tianhao Shen, Xueling Liu, Bill Yuchen Lin, Jie Fu, Wenhu Chen, and Xiang Yue. Opencodeinterpreter: Integrating code generation with execution and refinement. In Findings of the Association for Computational Linguistics ACL 2024 , pages 12834–12859,

work page 2024

[19] [19]

Delta-lora: Fine- tuning high-rank parameters with the delta of low-rank matrices

Bojia Zi, Xianbiao Qi, Lingzhi Wang, Jianan Wang, Kam-Fai Wong, and Lei Zhang. Delta-lora: Fine- tuning high-rank parameters with the delta of low-rank matrices. arXiv preprint arXiv:2309.02411,

work page arXiv

[20] [20]

tX k=2 βt−k 2 ξk−1 F # + (1 − β1)E

12 Appendix Appendix is organized as follows. Appendix A introduces details on RSVD. Appendix B provides a complete proof of Theorem 3.4. Appendix C presents additional experimental evidence on the low- rank structure and memory efficiency of MLorc. Appendix D gives detailed hyperparameter settings of experiments in Section 4 for reproducibility. A Detail...

work page 2011