LoRA vs. Full Fine-Tuning: A Theoretical Perspective

Ali Zindari; Rotem Mulayoff; Sebastian U. Stich

arxiv: 2605.19018 · v1 · pith:LCUR4CEWnew · submitted 2026-05-18 · 💻 cs.LG

LoRA vs. Full Fine-Tuning: A Theoretical Perspective

Ali Zindari , Rotem Mulayoff , Sebastian U. Stich This is my paper

Pith reviewed 2026-05-20 12:10 UTC · model grok-4.3

classification 💻 cs.LG

keywords LoRAfine-tuningexcess risklinear regressionlow-rank adaptationgeneralizationtask difference

0 comments

The pith

LoRA can achieve lower excess risk than full fine-tuning when the difference between pretraining and downstream tasks is low-rank.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper compares LoRA to full fine-tuning inside a linear regression model to identify when the cheaper method actually generalizes better. It shows that restricting the update to a low-rank form reduces excess risk precisely when the shift from pretraining to the new task itself has low rank. A reader would care because this supplies a concrete condition under which limiting expressivity helps rather than hurts, and it explains why very small ranks sometimes raise test accuracy even though they constrain what the model can represent. The analysis covers both overdetermined and underdetermined data regimes and is checked against practical tasks.

Core claim

In a linear regression setting, LoRA achieves lower excess risk than full fine-tuning in both overdetermined and underdetermined regimes when the difference between the pretraining parameters and the optimal downstream parameters is effectively low-rank. The theory further shows that the LoRA rank controls a bias-variance tradeoff, so that a very small rank can improve generalization by limiting expressivity even though it reduces the model's capacity to fit the downstream data.

What carries the argument

The low-rank parameterization of the parameter difference between pretraining and downstream tasks, used to derive explicit excess-risk bounds that are compared against the bounds for updating every weight.

If this is right

When the pretraining-to-downstream difference has low rank, LoRA with matching rank produces lower excess risk than updating all parameters.
A small LoRA rank functions as regularization and can raise test accuracy by preventing the model from fitting noise.
The identified advantage of LoRA holds in both overdetermined and underdetermined linear regression settings.
Experiments on practical tasks indicate that the same tradeoffs appear outside the linear case.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Estimating the effective rank of a task difference before fine-tuning could guide the choice between LoRA and full updates.
The low-rank shift idea may motivate hybrid adapters that apply full updates only on directions outside the low-rank subspace.
Similar analysis could be tested on nonlinear networks by measuring the rank of activation differences or gradient updates between pretraining and downstream.

Load-bearing premise

The difference between pretraining and downstream tasks can be modeled as effectively low-rank.

What would settle it

In a controlled linear regression experiment where the optimal parameter shift is known to be low-rank, full fine-tuning shows lower test error than LoRA across multiple random seeds.

Figures

Figures reproduced from arXiv: 2605.19018 by Ali Zindari, Rotem Mulayoff, Sebastian U. Stich.

**Figure 1.** Figure 1: Linear regression experiments. Panel (a) presents the excess risk of FFT and LoRA under varying sample size n (decreasing from left to right), fixed dimensions dx = dy = 100, noise magnitude σ = 12, and true task-shift rank rank(∆⋆ ) = 4. Panel (b) plots the excess risk as a function of noise level σ ∈ [1, 100] with dx = dy = 100, n = 50, and rank(∆⋆ ) = 10. Results are averaged over 100 random seeds; shad… view at source ↗

**Figure 2.** Figure 2: Effect of label noise in LLMs fine-tuning. We fine-tuned Qwen2.5 models using LoRA with various ranks across different levels of label noise. For each configuration, we trained 3 times using random seeds. Panels (a) and (b) show the mean and the range of the results for the 0.5B model fine-tuned on BoolQ and CommonsenseQA, respectively. Here, in the strong-noise regime, LoRA outperforms FFT as predicted by… view at source ↗

**Figure 3.** Figure 3: Effect of sample size in LLMs fine-tuning. We fine-tuned Qwen2.5 models using LoRA with various ranks across different sample sizes. For each configuration, we trained 3 times using random seeds. The top of Panels (a) and (b) shows the mean and the range of the results for the 1.5B model fine-tuned on BoolQ and CommonsenseQA, respectively. For each task, we estimated ∆⋆ and computed its singular values per… view at source ↗

**Figure 4.** Figure 4: Effect of label noise in LLMs fine-tuning. We fine-tuned Qwen2.5 models using LoRA with various ranks across different levels of label noise. For each configuration, we trained 3 times using random seeds. Panels (a) and (b) show the mean and the range of the results for the 0.5B model fine-tuned on BoolQ and CommonsenseQA, respectively. Here, in the strong-noise regime, LoRA outperforms FFT as predicted by… view at source ↗

**Figure 5.** Figure 5: Effect of sample size in LLMs fine-tuning. We fine-tuned Qwen2.5 models using LoRA with various ranks across different sample sizes. For each configuration, we trained 3 times using random seeds. The top of Panels (a) and (b) shows the mean and the range of the results for the 0.5B model fine-tuned on BoolQ and CommonsenseQA, respectively. For each task, we estimated ∆⋆ and computed its singular values per… view at source ↗

**Figure 6.** Figure 6: Effect of label noise in LLMs fine-tuning. We fine-tuned Qwen2.5 models using LoRA with various ranks across different levels of label noise. For each configuration, we trained 3 times using random seeds. Panels (a) and (b) show the mean and the range of the results for the 1.5B model fine-tuned on BoolQ and CommonsenseQA, respectively. Here, in the strong-noise regime, LoRA outperforms FFT as predicted by… view at source ↗

**Figure 7.** Figure 7: Effect of sample size in LLMs fine-tuning. We fine-tuned Qwen2.5 models using LoRA with various ranks across different sample sizes. For each configuration, we trained 3 times using random seeds. The top of Panels (a) and (b) shows the mean and the range of the results for the 1.5B model fine-tuned on BoolQ and CommonsenseQA, respectively. For each task, we estimated ∆⋆ and computed its singular values per… view at source ↗

**Figure 8.** Figure 8: Effect of label noise in LLMs fine-tuning. We fine-tuned Qwen2.5 models using LoRA with various ranks across different levels of label noise. For each configuration, we trained 3 times using random seeds. Panels (a) and (b) show the mean and the range of the results for the 3B model fine-tuned on BoolQ and CommonsenseQA, respectively. Here, in the strong-noise regime, LoRA outperforms FFT as predicted by o… view at source ↗

**Figure 9.** Figure 9: Effect of sample size in LLMs fine-tuning. We fine-tuned Qwen2.5 models using LoRA with various ranks across different sample sizes. For each configuration, we trained 3 times using random seeds. The top of Panels (a) and (b) shows the mean and the range of the results for the 3B model fine-tuned on BoolQ and CommonsenseQA, respectively. For each task, we estimated ∆⋆ and computed its singular values per l… view at source ↗

**Figure 10.** Figure 10: Linear regression experiments. Panel (a) presents the excess risk of FFT and LoRA under varying sample size n (decreasing from left to right), fixed dimensions dx = dy = 100, noise magnitude σ = 1, and true task-shift rank rank(∆⋆ ) = 4. Panel (b) plots the excess risk as a function of noise level σ ∈ [1, 100] with dx = dy = 100, n = 1000, and rank(∆⋆ ) = 10. Results are averaged over 100 random seeds; sh… view at source ↗

**Figure 11.** Figure 11: This figure illustrates the role of the central quantity [PITH_FULL_IMAGE:figures/full_fig_p041_11.png] view at source ↗

**Figure 12.** Figure 12: Effect of spectral decay on FFT vs. LoRA. Excess risk as a function of the singular value decay rate λ for a full-rank ∆⋆ with σi = 5 exp(−λi). A flat spectrum (λ = 0) favors FFT, while increasing spectral concentration leads to superior performance of LoRA with moderate rank. Settings: dx = dy = 40, n = 200, σε = 0.5, averaged over 100 runs. 42 [PITH_FULL_IMAGE:figures/full_fig_p042_12.png] view at source ↗

read the original abstract

Fine-tuning adapts a pre-trained model to downstream tasks using a small amount of labeled data. Low-Rank Adaptation (LoRA) is an efficient fine-tuning method that reduces memory and computation costs while often achieving performance close to full fine-tuning. Despite its widespread use, the theoretical behavior of LoRA is not yet well understood. In this paper, we study LoRA in a simple linear regression setting and compare its excess risk with that of full fine-tuning. Our analysis identifies regimes in which LoRA achieves lower excess risk than full fine-tuning in both overdetermined and underdetermined settings. Specifically, our theory predicts that LoRA can outperform full fine-tuning when the difference between the pretraining and the downstream tasks is effectively low-rank. We further show how the choice of LoRA rank affects generalization performance, explaining why using a very small rank can improve test accuracy in certain settings, even though it limits model expressivity. Finally, we support our theoretical results with experiments on practical tasks, suggesting that the identified tradeoffs and insights extend beyond linear regression.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper derives excess-risk expressions showing LoRA can beat full fine-tuning in linear regression when the pretrain-downstream weight difference is low-rank, but the advantage may not survive without covariance-subspace alignment.

read the letter

This paper sets up linear regression to compare excess risk of LoRA against full fine-tuning and identifies regimes where LoRA wins when the task difference is low-rank. It covers both overdetermined and underdetermined cases and explains why a small rank can sometimes improve test performance by cutting variance more than it adds bias. That framing is new enough to be worth looking at and gives a concrete handle on the rank choice that practitioners already tune by hand. The derivations are explicit and the low-rank condition on the weight difference is stated up front, which makes the predictions falsifiable in principle. Experiments on practical tasks are included to suggest the pattern carries over, though they stay suggestive rather than definitive. The main soft spot is exactly the one the stress-test flags. The risk expressions separate cleanly only if the sample covariance lines up with the column space of the low-rank difference; otherwise LoRA's projection step can introduce extra bias that full fine-tuning avoids. The paper states the low-rank assumption but does not appear to bound the operator-norm distance between covariance and subspace or to carry the resulting additive error through the comparison. That leaves the claimed regimes narrower than they read at first. The linear setting is a reasonable place to start, but without that robustness check the practical takeaway stays conditional. This work is aimed at people who want theoretical guidance on when parameter-efficient methods are preferable rather than just cheaper. A reader who already follows generalization bounds for adaptation will get the most out of the explicit tradeoffs. It deserves a serious referee because the closed forms and the rank effect are checkable and the low-rank modeling choice is stated plainly enough to be debated or strengthened.

Referee Report

2 major / 2 minor

Summary. The paper analyzes LoRA versus full fine-tuning in a linear regression setting, deriving closed-form excess-risk expressions for both methods in overdetermined and underdetermined regimes. It claims that LoRA achieves lower excess risk than full fine-tuning when the difference between pretraining and downstream tasks is low-rank, examines how LoRA rank affects generalization, and supports the theory with experiments on practical tasks.

Significance. If the derivations hold, the work supplies a concrete theoretical account of when and why parameter-efficient methods can outperform full fine-tuning, including an explanation for the benefit of small ranks in certain regimes. The identification of explicit low-rank conditions and the accompanying risk formulas constitute a useful step toward understanding fine-tuning trade-offs beyond empirical observation.

major comments (2)

[Section 3] Section 3 (main theoretical derivations): the excess-risk comparison between LoRA and full fine-tuning is derived under the assumption that the task difference Δ is exactly (or effectively) rank-r. The risk expressions separate cleanly only when the sample covariance aligns with the column space of Δ; the manuscript does not bound the operator-norm distance between the covariance and this subspace or quantify the resulting additive bias term that LoRA would incur. This omission is load-bearing for the central claim that LoRA outperforms full fine-tuning in the stated regimes.
[Theorem statements] Theorem statements (around the over- and under-determined cases): the low-rank modeling choice for Δ is presented as sufficient for the superiority result, yet no independent verification or sensitivity analysis is provided to show that the assumption is not chosen post-hoc to match the desired regime. Without such checks, the predicted advantage remains conditional on an untested modeling premise.

minor comments (2)

[Notation] The notation for the pretraining weights and the downstream target could be introduced earlier and used consistently when defining excess risk.
[Experiments] In the experimental section, it would help to report how closely the real-task weight differences approximate the low-rank assumption (e.g., via singular-value spectra).

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments on our manuscript. The feedback identifies key assumptions in our theoretical analysis that warrant further clarification and strengthening. We respond to each major comment below, indicating planned revisions where appropriate.

read point-by-point responses

Referee: [Section 3] Section 3 (main theoretical derivations): the excess-risk comparison between LoRA and full fine-tuning is derived under the assumption that the task difference Δ is exactly (or effectively) rank-r. The risk expressions separate cleanly only when the sample covariance aligns with the column space of Δ; the manuscript does not bound the operator-norm distance between the covariance and this subspace or quantify the resulting additive bias term that LoRA would incur. This omission is load-bearing for the central claim that LoRA outperforms full fine-tuning in the stated regimes.

Authors: We agree that the clean separation of risk expressions in Section 3 relies on alignment between the sample covariance and the column space of Δ. Our analysis isolates the low-rank effect under this condition, which is a standard modeling choice to derive explicit comparisons. To strengthen the result, we will revise the section to include a bound on the operator-norm distance between the covariance and the subspace, along with a quantification of the resulting additive bias in the excess-risk difference. This addition will demonstrate that LoRA retains an advantage under bounded misalignment, consistent with practical feature distributions. revision: yes
Referee: [Theorem statements] Theorem statements (around the over- and under-determined cases): the low-rank modeling choice for Δ is presented as sufficient for the superiority result, yet no independent verification or sensitivity analysis is provided to show that the assumption is not chosen post-hoc to match the desired regime. Without such checks, the predicted advantage remains conditional on an untested modeling premise.

Authors: The low-rank modeling of Δ is motivated by the structure of task differences observed in transfer learning and is explicitly stated as a condition in the theorems rather than presented as always true. Our experiments on practical tasks provide supporting evidence that effective rank is often low. To directly address the concern, we will add a sensitivity analysis consisting of additional simulations that vary the rank of Δ and report the resulting excess-risk comparisons, confirming that the predicted advantage is observed primarily in the low-rank regime. revision: yes

Circularity Check

0 steps flagged

No significant circularity; derivation self-contained under explicit assumptions

full rationale

The paper sets up a linear regression model with pretraining and downstream tasks, assumes the task difference Δ is low-rank (or effectively so), and derives closed-form excess risk expressions for full fine-tuning versus LoRA under that condition. The low-rank property is stated as the modeling choice that identifies the outperforming regime rather than being derived from or defined in terms of the risk expressions themselves. No equations reduce the claimed predictions to fitted parameters or prior self-citations by construction. Standard linear algebra steps for excess risk (involving covariance and projection) are used without requiring the covariance to commute with the subspace as an unstated hidden assumption that collapses the result. Experiments provide separate empirical support. The derivation chain is therefore independent of its inputs.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The central claims rest on the linear regression model and the low-rank task-difference assumption; no explicit free parameters or invented entities are described in the abstract.

free parameters (1)

LoRA rank r
The rank is a modeling choice that controls expressivity and is shown to affect generalization performance.

axioms (1)

domain assumption The difference between pretraining and downstream tasks is effectively low-rank
This premise is invoked to identify the regimes where LoRA achieves lower excess risk.

pith-pipeline@v0.9.0 · 5716 in / 1236 out tokens · 37393 ms · 2026-05-20T12:10:17.763376+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

30 extracted references · 30 canonical work pages · 2 internal anchors

[1]

MIT Press, 2024

Francis Bach.Learning Theory from First Principles. MIT Press, 2024

work page 2024
[2]

Benign overfitting in linear regression.Proceedings of the National Academy of Sciences, 117(48):30063–30070, 2020

Peter L Bartlett, Philip M Long, Gábor Lugosi, and Alexander Tsigler. Benign overfitting in linear regression.Proceedings of the National Academy of Sciences, 117(48):30063–30070, 2020

work page 2020
[3]

LoRA learns less and forgets less.Transactions on Machine Learning Research, 2024

Dan Biderman, Jacob Portes, Jose Javier Gonzalez Ortiz, Mansheej Paul, Philip Greengard, Connor Jennings, Daniel King, Sam Havens, Vitaliy Chiley, Jonathan Frankle, Cody Blakeney, and John Patrick Cunningham. LoRA learns less and forgets less.Transactions on Machine Learning Research, 2024. ISSN 2835-8856

work page 2024
[4]

Oxford University Press, Oxford, UK, 1 edition, 2013

Stéphane Boucheron, Gábor Lugosi, and Pascal Massart.Concentration Inequalities: A Nonasymptotic Theory of Independence. Oxford University Press, Oxford, UK, 1 edition, 2013. ISBN 978-0-19-953525-5

work page 2013
[5]

Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al

Tom B. Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. Language models are few-shot learners. InAdvances in Neural Information Processing Systems. Curran Associates, Inc., 2020

work page 2020
[6]

PaLM: Scaling language modeling with pathways.Journal of Machine Learning Research, 24(240): 1–113, 2023

Aakanksha Chowdhery, Sharan Narang, Jacob Devlin, Maarten Bosma, Gaurav Mishra, Adam Roberts, Paul Barham, Hyung Won Chung, Charles Sutton, Sebastian Gehrmann, et al. PaLM: Scaling language modeling with pathways.Journal of Machine Learning Research, 24(240): 1–113, 2023

work page 2023
[7]

BoolQ: Exploring the surprising difficulty of natural yes/no questions

Christopher Clark, Kenton Lee, Ming-Wei Chang, Tom Kwiatkowski, Michael Collins, and Kristina Toutanova. BoolQ: Exploring the surprising difficulty of natural yes/no questions. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long and Short Papers)....

work page 2019
[8]

QLoRA: Efficient finetuning of quantized LLMs

Tim Dettmers, Artidoro Pagnoni, Ari Holtzman, and Luke Zettlemoyer. QLoRA: Efficient finetuning of quantized LLMs. InAdvances in Neural Information Processing Systems. Curran Associates, Inc., 2023

work page 2023
[9]

BERT: Pre-training of deep bidirectional transformers for language understanding

Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. BERT: Pre-training of deep bidirectional transformers for language understanding. InProceedings of the 2019 Confer- ence of the North American Chapter of the Association for Computational Linguistics: Human 10 Language Technologies (Volume 1: Long and Short Papers). Association for Computat...

work page 2019
[10]

Smith , title =

Jesse Dodge, Gabriel Ilharco, Roy Schwartz, Ali Farhadi, Hannaneh Hajishirzi, and Noah Smith. Fine-tuning pretrained language models: Weight initializations, data orders, and early stopping. arXiv preprint arXiv:2002.06305, 2020

work page arXiv 2002
[11]

Generalized rank-constrained matrix approximations

Shmuel Friedland and Anatoli Torokhti. Generalized rank-constrained matrix approximations. SIAM Journal on Matrix Analysis and Applications, 29(2):656–659, 2007

work page 2007
[12]

Chapman and Hall/CRC, 2021

Christophe Giraud.Introduction to High-Dimensional Statistics. Chapman and Hall/CRC, 2021

work page 2021
[13]

Some inequalities for Gaussian processes and applications.Israel Journal of Mathematics, 50(4):265–289, 1985

Yehoram Gordon. Some inequalities for Gaussian processes and applications.Israel Journal of Mathematics, 50(4):265–289, 1985

work page 1985
[14]

Surprises in high- dimensional ridgeless least squares interpolation.Annals of Statistics, 50(2):949–986, 2022

Trevor Hastie, Andrea Montanari, Saharon Rosset, and Ryan J Tibshirani. Surprises in high- dimensional ridgeless least squares interpolation.Annals of Statistics, 50(2):949–986, 2022

work page 2022
[15]

LoRA: Low-rank adaptation of large language models

Edward J Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen. LoRA: Low-rank adaptation of large language models. InThe Tenth International Conference on Learning Representations, 2022

work page 2022
[16]

Camels in a changing climate: Enhancing lm adaptation with tulu 2,

Hamish Ivison, Yizhong Wang, Valentina Pyatkin, Nathan Lambert, Matthew Peters, Pradeep Dasigi, Joel Jang, David Wadden, Noah A Smith, Iz Beltagy, et al. Camels in a changing climate: Enhancing LM adaptation with Tulu 2.arXiv preprint arXiv:2311.10702, 2023

work page arXiv 2023
[17]

LoRA training provably converges to a low-rank global minimum or it fails loudly (but it probably won’t fail)

Junsu Kim, Jaeyeon Kim, and Ernest K Ryu. LoRA training provably converges to a low-rank global minimum or it fails loudly (but it probably won’t fail). InProceedings of the 42nd International Conference on Machine Learning. PMLR, 2025

work page 2025
[18]

Sharp Generalization Bounds for Foundation Models with Asymmetric Ran- domized Low-Rank Adapters

Anastasis Kratsios, Tin Sum Cheng, Aurelien Lucchi, and Haitz Sáez de Ocáriz Borde. Sharp generalization bounds for foundation models with asymmetric randomized low-rank adapters. arXiv preprint arXiv:2506.14530, 2025

work page arXiv 2025
[19]

and Hütter, J.-C

Philippe Rigollet and Jan-Christian Hütter. High-dimensional statistics.arXiv preprint arXiv:2310.19244, 2023

work page arXiv 2023
[20]

LoRA vs full fine-tuning: An illusion of equivalence

Reece Shuttleworth, Jacob Andreas, Antonio Torralba, and Pratyusha Sharma. LoRA vs full fine-tuning: An illusion of equivalence. InAdvances in Neural Information Processing Systems. Curran Associates, Inc., 2025

work page 2025
[21]

The smallest eigenvalue of a large dimensional Wishart matrix.The Annals of Probability, 13(4):1364–1368, 1985

Jack W Silverstein. The smallest eigenvalue of a large dimensional Wishart matrix.The Annals of Probability, 13(4):1364–1368, 1985

work page 1985
[22]

Best approximate solutions to matrix equations under rank restrictions

Dieter Sondermann. Best approximate solutions to matrix equations under rank restrictions. Statistische Hefte, 27(1):57–66, 1986

work page 1986
[23]

CommonsenseQA: A question answering challenge targeting commonsense knowledge

Alon Talmor, Jonathan Herzig, Nicholas Lourie, and Jonathan Berant. CommonsenseQA: A question answering challenge targeting commonsense knowledge. InProceedings of the 2019 Conference of the North American Chapter of the Association for Computational Lin- guistics: Human Language Technologies (Volume 1: Long and Short Papers). Association for Computationa...

work page 2019
[24]

LLaMA: Open and Efficient Foundation Language Models

Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timo- thée Lacroix, Baptiste Rozière, Naman Goyal, Eric Hambro, Faisal Azhar, et al. LLaMA: Open and efficient foundation language models.arXiv preprint arXiv:2302.13971, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[25]

Why are big data matrices approximately low rank? SIAM Journal on Mathematics of Data Science, 1(1):144–160, 2019

Madeleine Udell and Alex Townsend. Why are big data matrices approximately low rank? SIAM Journal on Mathematics of Data Science, 1(1):144–160, 2019. 11

work page 2019
[26]

Qwen2.5 technical report, 2025

An Yang, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chengyuan Li, Dayiheng Liu, Fei Huang, Haoran Wei, Huan Lin, Jian Yang, Jianhong Tu, Jianwei Zhang, Jianxin Yang, Jiaxi Yang, Jingren Zhou, Junyang Lin, Kai Dang, Keming Lu, Keqin Bao, Kexin Yang, Le Yu, Mei Li, Mingfeng Xue, Pei Zhang, Qin Zhu, Rui Men, Runji Lin, Tianhao Li, Tianyi T...

work page 2025
[27]

The expressive power of low-rank adaptation

Yuchen Zeng and Kangwook Lee. The expressive power of low-rank adaptation. InThe Twelfth International Conference on Learning Representations, 2024

work page 2024
[28]

LoRA-one: One-step full gradient could suffice for fine-tuning large language models, provably and efficiently

Yuanhe Zhang, Fanghui Liu, and Yudong Chen. LoRA-one: One-step full gradient could suffice for fine-tuning large language models, provably and efficiently. InProceedings of the 42nd International Conference on Machine Learning. PMLR, 2025

work page 2025
[29]

Astraios: Parameter-efficient instruction tuning code large language models.arXiv preprint arXiv:2401.00788, 2024

Terry Yue Zhuo, Armel Zebaze, Nitchakarn Suppattarachai, Leandro von Werra, Harm de Vries, Qian Liu, and Niklas Muennighoff. Astraios: Parameter-efficient instruction tuning code large language models.arXiv preprint arXiv:2401.00788, 2024

work page arXiv 2024
[30]

Fine-Tuning Language Models from Human Preferences

Daniel M Ziegler, Nisan Stiennon, Jeffrey Wu, Tom B Brown, Alec Radford, Dario Amodei, Paul Christiano, and Geoffrey Irving. Fine-tuning language models from human preferences. arXiv preprint arXiv:1909.08593, 2019. 12 Appendix I Summary of Notation Table 1: Summary of notation used in the paper Symbol Definition Mathematical definition / short note xi,x ...

work page internal anchor Pith review Pith/arXiv arXiv 1909

[1] [1]

MIT Press, 2024

Francis Bach.Learning Theory from First Principles. MIT Press, 2024

work page 2024

[2] [2]

Benign overfitting in linear regression.Proceedings of the National Academy of Sciences, 117(48):30063–30070, 2020

Peter L Bartlett, Philip M Long, Gábor Lugosi, and Alexander Tsigler. Benign overfitting in linear regression.Proceedings of the National Academy of Sciences, 117(48):30063–30070, 2020

work page 2020

[3] [3]

LoRA learns less and forgets less.Transactions on Machine Learning Research, 2024

Dan Biderman, Jacob Portes, Jose Javier Gonzalez Ortiz, Mansheej Paul, Philip Greengard, Connor Jennings, Daniel King, Sam Havens, Vitaliy Chiley, Jonathan Frankle, Cody Blakeney, and John Patrick Cunningham. LoRA learns less and forgets less.Transactions on Machine Learning Research, 2024. ISSN 2835-8856

work page 2024

[4] [4]

Oxford University Press, Oxford, UK, 1 edition, 2013

Stéphane Boucheron, Gábor Lugosi, and Pascal Massart.Concentration Inequalities: A Nonasymptotic Theory of Independence. Oxford University Press, Oxford, UK, 1 edition, 2013. ISBN 978-0-19-953525-5

work page 2013

[5] [5]

Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al

Tom B. Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. Language models are few-shot learners. InAdvances in Neural Information Processing Systems. Curran Associates, Inc., 2020

work page 2020

[6] [6]

PaLM: Scaling language modeling with pathways.Journal of Machine Learning Research, 24(240): 1–113, 2023

Aakanksha Chowdhery, Sharan Narang, Jacob Devlin, Maarten Bosma, Gaurav Mishra, Adam Roberts, Paul Barham, Hyung Won Chung, Charles Sutton, Sebastian Gehrmann, et al. PaLM: Scaling language modeling with pathways.Journal of Machine Learning Research, 24(240): 1–113, 2023

work page 2023

[7] [7]

BoolQ: Exploring the surprising difficulty of natural yes/no questions

Christopher Clark, Kenton Lee, Ming-Wei Chang, Tom Kwiatkowski, Michael Collins, and Kristina Toutanova. BoolQ: Exploring the surprising difficulty of natural yes/no questions. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long and Short Papers)....

work page 2019

[8] [8]

QLoRA: Efficient finetuning of quantized LLMs

Tim Dettmers, Artidoro Pagnoni, Ari Holtzman, and Luke Zettlemoyer. QLoRA: Efficient finetuning of quantized LLMs. InAdvances in Neural Information Processing Systems. Curran Associates, Inc., 2023

work page 2023

[9] [9]

BERT: Pre-training of deep bidirectional transformers for language understanding

Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. BERT: Pre-training of deep bidirectional transformers for language understanding. InProceedings of the 2019 Confer- ence of the North American Chapter of the Association for Computational Linguistics: Human 10 Language Technologies (Volume 1: Long and Short Papers). Association for Computat...

work page 2019

[10] [10]

Smith , title =

Jesse Dodge, Gabriel Ilharco, Roy Schwartz, Ali Farhadi, Hannaneh Hajishirzi, and Noah Smith. Fine-tuning pretrained language models: Weight initializations, data orders, and early stopping. arXiv preprint arXiv:2002.06305, 2020

work page arXiv 2002

[11] [11]

Generalized rank-constrained matrix approximations

Shmuel Friedland and Anatoli Torokhti. Generalized rank-constrained matrix approximations. SIAM Journal on Matrix Analysis and Applications, 29(2):656–659, 2007

work page 2007

[12] [12]

Chapman and Hall/CRC, 2021

Christophe Giraud.Introduction to High-Dimensional Statistics. Chapman and Hall/CRC, 2021

work page 2021

[13] [13]

Some inequalities for Gaussian processes and applications.Israel Journal of Mathematics, 50(4):265–289, 1985

Yehoram Gordon. Some inequalities for Gaussian processes and applications.Israel Journal of Mathematics, 50(4):265–289, 1985

work page 1985

[14] [14]

Surprises in high- dimensional ridgeless least squares interpolation.Annals of Statistics, 50(2):949–986, 2022

Trevor Hastie, Andrea Montanari, Saharon Rosset, and Ryan J Tibshirani. Surprises in high- dimensional ridgeless least squares interpolation.Annals of Statistics, 50(2):949–986, 2022

work page 2022

[15] [15]

LoRA: Low-rank adaptation of large language models

Edward J Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen. LoRA: Low-rank adaptation of large language models. InThe Tenth International Conference on Learning Representations, 2022

work page 2022

[16] [16]

Camels in a changing climate: Enhancing lm adaptation with tulu 2,

Hamish Ivison, Yizhong Wang, Valentina Pyatkin, Nathan Lambert, Matthew Peters, Pradeep Dasigi, Joel Jang, David Wadden, Noah A Smith, Iz Beltagy, et al. Camels in a changing climate: Enhancing LM adaptation with Tulu 2.arXiv preprint arXiv:2311.10702, 2023

work page arXiv 2023

[17] [17]

LoRA training provably converges to a low-rank global minimum or it fails loudly (but it probably won’t fail)

Junsu Kim, Jaeyeon Kim, and Ernest K Ryu. LoRA training provably converges to a low-rank global minimum or it fails loudly (but it probably won’t fail). InProceedings of the 42nd International Conference on Machine Learning. PMLR, 2025

work page 2025

[18] [18]

Sharp Generalization Bounds for Foundation Models with Asymmetric Ran- domized Low-Rank Adapters

Anastasis Kratsios, Tin Sum Cheng, Aurelien Lucchi, and Haitz Sáez de Ocáriz Borde. Sharp generalization bounds for foundation models with asymmetric randomized low-rank adapters. arXiv preprint arXiv:2506.14530, 2025

work page arXiv 2025

[19] [19]

and Hütter, J.-C

Philippe Rigollet and Jan-Christian Hütter. High-dimensional statistics.arXiv preprint arXiv:2310.19244, 2023

work page arXiv 2023

[20] [20]

LoRA vs full fine-tuning: An illusion of equivalence

Reece Shuttleworth, Jacob Andreas, Antonio Torralba, and Pratyusha Sharma. LoRA vs full fine-tuning: An illusion of equivalence. InAdvances in Neural Information Processing Systems. Curran Associates, Inc., 2025

work page 2025

[21] [21]

The smallest eigenvalue of a large dimensional Wishart matrix.The Annals of Probability, 13(4):1364–1368, 1985

Jack W Silverstein. The smallest eigenvalue of a large dimensional Wishart matrix.The Annals of Probability, 13(4):1364–1368, 1985

work page 1985

[22] [22]

Best approximate solutions to matrix equations under rank restrictions

Dieter Sondermann. Best approximate solutions to matrix equations under rank restrictions. Statistische Hefte, 27(1):57–66, 1986

work page 1986

[23] [23]

CommonsenseQA: A question answering challenge targeting commonsense knowledge

Alon Talmor, Jonathan Herzig, Nicholas Lourie, and Jonathan Berant. CommonsenseQA: A question answering challenge targeting commonsense knowledge. InProceedings of the 2019 Conference of the North American Chapter of the Association for Computational Lin- guistics: Human Language Technologies (Volume 1: Long and Short Papers). Association for Computationa...

work page 2019

[24] [24]

LLaMA: Open and Efficient Foundation Language Models

Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timo- thée Lacroix, Baptiste Rozière, Naman Goyal, Eric Hambro, Faisal Azhar, et al. LLaMA: Open and efficient foundation language models.arXiv preprint arXiv:2302.13971, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[25] [25]

Why are big data matrices approximately low rank? SIAM Journal on Mathematics of Data Science, 1(1):144–160, 2019

Madeleine Udell and Alex Townsend. Why are big data matrices approximately low rank? SIAM Journal on Mathematics of Data Science, 1(1):144–160, 2019. 11

work page 2019

[26] [26]

Qwen2.5 technical report, 2025

An Yang, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chengyuan Li, Dayiheng Liu, Fei Huang, Haoran Wei, Huan Lin, Jian Yang, Jianhong Tu, Jianwei Zhang, Jianxin Yang, Jiaxi Yang, Jingren Zhou, Junyang Lin, Kai Dang, Keming Lu, Keqin Bao, Kexin Yang, Le Yu, Mei Li, Mingfeng Xue, Pei Zhang, Qin Zhu, Rui Men, Runji Lin, Tianhao Li, Tianyi T...

work page 2025

[27] [27]

The expressive power of low-rank adaptation

Yuchen Zeng and Kangwook Lee. The expressive power of low-rank adaptation. InThe Twelfth International Conference on Learning Representations, 2024

work page 2024

[28] [28]

LoRA-one: One-step full gradient could suffice for fine-tuning large language models, provably and efficiently

Yuanhe Zhang, Fanghui Liu, and Yudong Chen. LoRA-one: One-step full gradient could suffice for fine-tuning large language models, provably and efficiently. InProceedings of the 42nd International Conference on Machine Learning. PMLR, 2025

work page 2025

[29] [29]

Astraios: Parameter-efficient instruction tuning code large language models.arXiv preprint arXiv:2401.00788, 2024

Terry Yue Zhuo, Armel Zebaze, Nitchakarn Suppattarachai, Leandro von Werra, Harm de Vries, Qian Liu, and Niklas Muennighoff. Astraios: Parameter-efficient instruction tuning code large language models.arXiv preprint arXiv:2401.00788, 2024

work page arXiv 2024

[30] [30]

Fine-Tuning Language Models from Human Preferences

Daniel M Ziegler, Nisan Stiennon, Jeffrey Wu, Tom B Brown, Alec Radford, Dario Amodei, Paul Christiano, and Geoffrey Irving. Fine-tuning language models from human preferences. arXiv preprint arXiv:1909.08593, 2019. 12 Appendix I Summary of Notation Table 1: Summary of notation used in the paper Symbol Definition Mathematical definition / short note xi,x ...

work page internal anchor Pith review Pith/arXiv arXiv 1909