High-Dimensional Theory of LoRA Fine-Tuning in a Solvable Attention Model

F. Boncoraglio; L. Zdeborov\'a; O. Duranthon

arxiv: 2606.05899 · v1 · pith:2QS4AU4Gnew · submitted 2026-06-04 · 💻 cs.LG · cond-mat.dis-nn

High-Dimensional Theory of LoRA Fine-Tuning in a Solvable Attention Model

O. Duranthon , F. Boncoraglio , L. Zdeborov\'a This is my paper

Pith reviewed 2026-06-28 02:12 UTC · model grok-4.3

classification 💻 cs.LG cond-mat.dis-nn

keywords LoRA fine-tuningattention modelshigh-dimensional asymptoticspre-trainingorder parameterstest errorrepresentation alignmentactive learning

0 comments

The pith

Pre-training affects LoRA fine-tuning only through an effective noise term whose strength can be minimized by choice of pre-training task.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper constructs a solvable single-head attention model that is first pre-trained on abundant data and then adapted with a rank-one LoRA update on scarce data. In the high-dimensional limit both stages are tracked exactly by a closed set of order parameters that yield explicit formulas for test error and representation alignment. The central result is that every detail of the pre-training stage collapses into a single effective noise variance felt by the LoRA step; minimizing that variance supplies concrete rules for how to pre-train. A secondary finding is a regime in which test error and representation quality diverge, and the framework is applied to active selection of fine-tuning examples.

Core claim

In the solvable attention model, the influence of pre-training on subsequent rank-one LoRA adaptation is fully captured by an effective noise term; the optimal pre-training procedure is the one that minimizes this term, and the same order-parameter equations also reveal a mismatch between test error and representation alignment under certain data regimes.

What carries the argument

The finite set of order parameters that give the exact high-dimensional asymptotics of both the pre-training and the rank-one LoRA fine-tuning stages.

If this is right

The optimal pre-training task is the one that produces the smallest effective noise variance for the downstream LoRA step.
Representation alignment and test error can be controlled independently by choice of pre-training data distribution.
Active fine-tuning can be performed by selecting examples that most reduce the effective noise felt by the LoRA update.
The same order-parameter analysis supplies closed-form predictions for test error after any rank-one LoRA update.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The reduction to effective noise may extend to other low-rank adaptation schemes beyond rank-one LoRA.
The mismatch between error and representation quality suggests that representation probes alone may not reliably predict downstream performance after adaptation.
If the order-parameter equations remain tractable for multi-head or deeper attention stacks, the same noise-reduction principle could guide pre-training of full transformers.

Load-bearing premise

Both the pre-training and the fine-tuning stages admit a sharp asymptotic characterization in terms of a finite set of order parameters in the high-dimensional limit.

What would settle it

Numerical simulation of the solvable attention model at increasing dimension should show the predicted test error and alignment curves converging to the order-parameter formulas; any persistent deviation at large dimension would falsify the reduction to effective noise.

Figures

Figures reproduced from arXiv: 2606.05899 by F. Boncoraglio, L. Zdeborov\'a, O. Duranthon.

**Figure 1.** Figure 1: Fine-tuning test error E ′ versus pre-training regularization λ for λ ′ ∈ {0.05, 1, ∞}, at α = 0.1, α ′ = 3, T = 3, κ0 = κ = 1, and noise levels ∆ = 0.5 (left) and ∆ = 2 (right). Insets show E ′ versus λ ′ at the optimal λ. Solid lines are state-evolution predictions; square markers are D = 150 simulations averaged over four runs. Vertical dark-red dashed lines mark curve minima, and the black dashed line … view at source ↗

**Figure 2.** Figure 2: Effect of pre-training on the fine-tuning task. Fine-tuning test error [PITH_FULL_IMAGE:figures/full_fig_p008_2.png] view at source ↗

**Figure 3.** Figure 3: Fine-tuning input samples form the pre-training set [PITH_FULL_IMAGE:figures/full_fig_p009_3.png] view at source ↗

**Figure 4.** Figure 4: Active fine-tuning. Left: fine-tuning test errors [PITH_FULL_IMAGE:figures/full_fig_p011_4.png] view at source ↗

**Figure 5.** Figure 5: Effect of pre-training on the fine-tuning task for the rescaled identity activation [PITH_FULL_IMAGE:figures/full_fig_p024_5.png] view at source ↗

**Figure 6.** Figure 6: Recalibration of the frozen extensive-rank attention map improves downstream [PITH_FULL_IMAGE:figures/full_fig_p025_6.png] view at source ↗

read the original abstract

We develop a high-dimensional statistical theory of low-rank adaptation (LoRA) in attention models, capturing the interplay between pre-training and fine-tuning. We introduce a solvable framework in which a single-head attention layer is first pre-trained on a data-abundant task and subsequently adapted via a rank-one LoRA update on limited data. In the high-dimensional limit, both stages admit a sharp asymptotic characterization in terms of a finite set of order parameters, yielding explicit predictions for test errors and representation alignment. Our analysis shows that the impact of pre-training on LoRA is summarized by an effective noise term, from which we derive prescriptions for the optimal pre-training procedure. We also demonstrate a regime with a mismatch between the value of the test error and representation quality, and propose an application of our theory to active fine-tuning.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

This paper sets up a solvable single-head attention model where pre-training effects on rank-one LoRA reduce to an effective noise term, giving explicit predictions but staying inside a stylized high-dimensional limit.

read the letter

The punchline is that pre-training and LoRA fine-tuning are analyzed together in one mean-field framework, and the pre-training stage collapses into a single effective noise parameter that then controls the fine-tuning error. That reduction is the clearest new element.

The work does a clean job deriving order parameters for both stages and using them to get closed-form expressions for test error and representation alignment. The observation that test error and representation quality can decouple is a useful side result, and the prescriptions for pre-training follow directly from the noise term. The math stays internally consistent with standard high-dimensional analyses of attention.

The main limitation is the narrow scope: everything is derived for a single-head attention layer in the high-dimensional limit with specific data assumptions. How much of the effective-noise picture carries over to multi-head transformers or real data distributions is not addressed. The abstract gives no sign of simulation checks against the asymptotic predictions, so the practical accuracy of the formulas remains untested in the provided description.

This is aimed at people who follow theoretical work on fine-tuning and mean-field methods for transformers. A reader already comfortable with order-parameter techniques will find the framework straightforward to engage with. The central claims are precise enough and the setup original enough that it should go to referees rather than get desk-rejected.

Referee Report

0 major / 2 minor

Summary. The paper develops a high-dimensional statistical theory of LoRA fine-tuning in attention models using a solvable single-head attention framework. A model is first pre-trained on abundant data and then adapted via rank-one LoRA on limited data. In the high-dimensional limit, both stages receive sharp asymptotic characterizations in terms of a finite set of order parameters, yielding explicit predictions for test errors and representation alignment. The impact of pre-training is summarized by an effective noise term, from which prescriptions for the optimal pre-training procedure are derived. The work also identifies a regime with mismatch between test error and representation quality and proposes an application to active fine-tuning.

Significance. If the asymptotic characterizations and order-parameter analysis hold, the paper supplies a rigorous mean-field-style framework for the interplay between pre-training and LoRA adaptation. The reduction of pre-training effects to an effective noise term is a potentially useful simplification that could guide practical choices in data-efficient fine-tuning. The mismatch regime and active-learning application add concrete implications beyond the core theory.

minor comments (2)

[Abstract] The abstract states that both stages 'admit a sharp asymptotic characterization' but does not preview the explicit form of the order-parameter equations or the effective noise term; adding one sentence with the key expressions would improve readability for readers scanning the abstract.
Notation for the order parameters (e.g., how the effective noise is defined from the pre-training stage) should be introduced with a clear table or list early in the main text to avoid repeated cross-references.

Simulated Author's Rebuttal

0 responses · 0 unresolved

We thank the referee for the positive summary, significance assessment, and recommendation of minor revision. The report correctly captures the core contributions of the solvable attention model, the effective noise term, the mismatch regime, and the active fine-tuning application. No specific major comments were provided in the report.

Circularity Check

0 steps flagged

No significant circularity detected

full rationale

The paper develops a high-dimensional asymptotic analysis of a solvable single-head attention model, characterizing both pre-training and rank-one LoRA fine-tuning via a finite set of order parameters that yield explicit test-error and alignment predictions. The effective noise term summarizing pre-training impact is presented as an output of this two-stage mean-field calculation rather than an input assumption or fitted quantity. No quoted equations or self-citation chains reduce the central claims to tautological redefinitions or statistically forced predictions; the framework is internally consistent with standard high-dimensional statistical mechanics approaches and remains self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Based solely on the abstract, no explicit free parameters, axioms, or invented entities are identifiable; the theory relies on standard high-dimensional statistical-physics assumptions (thermodynamic limit, concentration of order parameters) that are not detailed here.

pith-pipeline@v0.9.1-grok · 5678 in / 1165 out tokens · 27726 ms · 2026-06-28T02:12:35.654300+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

38 extracted references · 5 linked inside Pith

[1]

Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Liang Wang, and Weizhu Chen

Edward J. Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Liang Wang, and Weizhu Chen. LoRA: Low-rank adaptation of large language models. In International Conference on Learning Representations, 2022. 11

2022
[2]

Parameter-efficient transfer learning for NLP

Neil Houlsby, Andrei Giurgiu, Stanislaw Jastrzebski, Bruna Morrone, Quentin De Larous- silhe, Andrea Gesmundo, Mona Attariyan, and Sylvain Gelly. Parameter-efficient transfer learning for NLP. InInternational Conference on Machine Learning, pages 2790–2799, 2019

2019
[3]

Prefix-tuning: Optimizing continuous prompts for generation

Xiang Lisa Li and Percy Liang. Prefix-tuning: Optimizing continuous prompts for generation. InProceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing, pages 4582– 4597, 2021

2021
[4]

QLoRA: Efficient finetuning of quantized LLMs

Tim Dettmers, Artidoro Pagnoni, Ari Holtzman, and Luke Zettlemoyer. QLoRA: Efficient finetuning of quantized LLMs. InAdvances in Neural Information Processing Systems, volume 36, pages 10088–10115, 2023

2023
[5]

Intrinsic dimensionality explains the effectiveness of language model fine-tuning

Armen Aghajanyan, Sonal Gupta, and Luke Zettlemoyer. Intrinsic dimensionality explains the effectiveness of language model fine-tuning. InProceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing, pages 7319–7328, 2021

2021
[6]

The expressive power of low-rank adaptation

Yuchen Zeng and Kangwook Lee. The expressive power of low-rank adaptation. In International Conference on Learning Representations, 2024. arXiv:2310.17513

arXiv 2024
[7]

LoRA-one: One-step full gradient could suffice for fine-tuning large language models, provably and efficiently

Yuanhe Zhang, Fanghui Liu, and Yudong Chen. LoRA-one: One-step full gradient could suffice for fine-tuning large language models, provably and efficiently. InProceedings of the 42nd International Conference on Machine Learning, volume 267, pages 75513–75574. PMLR, 2025. arXiv:2502.01235

arXiv 2025
[8]

Junsu Kim, Jaeyeon Kim, and Ernest K. Ryu. LoRA training provably converges to a low-rank global minimum or it fails loudly (But it probably won’t fail). InProceedings of the 42nd International Conference on Machine Learning, volume 267, pages 30224–30247. PMLR, 2025. arXiv:2502.09376

arXiv 2025
[9]

Understanding the learning dynamics of lora: A gradient flow perspective on low-rank adaptation in matrix factorization

Ziqing Xu, Hancheng Min, Lachlan Ewen MacDonald, Jinqi Luo, Salma Tarmoun, Enrique Mallada, and Rene Vidal. Understanding the learning dynamics of lora: A gradient flow perspective on low-rank adaptation in matrix factorization. InProceedings of The 28th International Conference on Artificial Intelligence and Statistics, volume 258, pages 4636–4644. PMLR,...

arXiv 2025
[10]

Zi Liang, Haibo Hu, Qingqing Ye, Yaxin Xiao, and Ronghua Li. Does low rank adaptation lead to lower robustness against training-time attacks? InProceedings of the 42nd Inter- national Conference on Machine Learning, volume 267, pages 37181–37207. PMLR, 2025. arXiv:2505.12871

arXiv 2025
[11]

On the convergence rate of lora gradient descent, 2025

Siqiao Mu and Diego Klabjan. On the convergence rate of lora gradient descent, 2025. arXiv preprint arXiv:2512.18248

Pith/arXiv arXiv 2025
[12]

When pre-training hurts LoRA fine-tuning: A dynamical analysis via single-index models, 2026

Gibbs Nwemadji, Bruno Loureiro, and Jean Barbier. When pre-training hurts LoRA fine-tuning: A dynamical analysis via single-index models, 2026. arXiv:2602.02855

Pith/arXiv arXiv 2026
[13]

Why LoRA resists label noise: A theoretical framework for noise-robust parameter-efficient fine-tuning, 2026

Brady Steele. Why LoRA resists label noise: A theoretical framework for noise-robust parameter-efficient fine-tuning, 2026. arXiv:2602.00084

arXiv 2026
[14]

Sharp generalization bounds for foundation models with asymmetric randomized low-rank adapters, 2025

Anastasis Kratsios, Tin Sum Cheng, Aurelien Lucchi, and Haitz Sáez de Ocáriz Borde. Sharp generalization bounds for foundation models with asymmetric randomized low-rank adapters, 2025. arXiv:2506.14530. 12

arXiv 2025
[15]

Lee, and Ernest K

Uijeong Jang, Jason D. Lee, and Ernest K. Ryu. LoRA training in the NTK regime has no spurious local minima. InProceedings of the 41st International Conference on Machine Learning, volume 235, pages 21306–21328. PMLR, 2024. arXiv:2402.11867

arXiv 2024
[16]

Bayes optimal learning of attention-indexed models

Fabrizio Boncoraglio, Emanuele Troiani, Vittorio Erba, and Lenka Zdeborová. Bayes optimal learning of attention-indexed models. InAdvances in Neural Information Processing Systems, volume 38, pages 105029–105074, 2025. arXiv:2506.01582

arXiv 2025
[17]

Single-head attention in high dimensions: A theory of generalization, weights spectra, and scaling laws

Fabrizio Boncoraglio, Vittorio Erba, Emanuele Troiani, Yizhou Xu, Florent Krzakala, and Lenka Zdeborová. Single-head attention in high dimensions: A theory of generalization, weights spectra, and scaling laws. InInternational Conference on Machine Learning, 2026. arXiv:2509.24914

Pith/arXiv arXiv 2026
[18]

Online stochastic gradient descent on non-convex losses from high-dimensional inference.Journal of Machine Learning Research, 22(106):1–51, 2021

Gerard Ben Arous, Reza Gheissari, and Aukosh Jagannath. Online stochastic gradient descent on non-convex losses from high-dimensional inference.Journal of Machine Learning Research, 22(106):1–51, 2021

2021
[19]

Alex Damian, Eshaan Nichani, Rong Ge, and Jason D. Lee. Smoothing the landscape boosts the signal for SGD: Optimal sample complexity for learning single index models. In Advances in Neural Information Processing Systems, volume 36, pages 752–784, 2023

2023
[20]

From high- dimensional and mean-field dynamics to dimensionless odes: A unifying approach to SGD in two-layer networks

Luca Arnaboldi, Ludovic Stephan, Florent Krzakala, and Bruno Loureiro. From high- dimensional and mean-field dynamics to dimensionless odes: A unifying approach to SGD in two-layer networks. InThe Thirty Sixth Annual Conference on Learning Theory, pages 1199–1227. PMLR, 2023

2023
[21]

Learning time-scales in two-layers neural networks.Foundations of Computational Mathematics, 25(5):1627–1710, 2025

Raphaël Berthier, Andrea Montanari, and Kangjie Zhou. Learning time-scales in two-layers neural networks.Foundations of Computational Mathematics, 25(5):1627–1710, 2025

2025
[22]

High-dimensional learning of narrow neural networks.Journal of Statistical Mechanics: Theory and Experiment, 2025(2):023402, 2025

Hugo Cui. High-dimensional learning of narrow neural networks.Journal of Statistical Mechanics: Theory and Experiment, 2025(2):023402, 2025

2025
[23]

A phase transition between positional and semantic learning in a solvable model of dot-product attention

Hugo Cui, Freya Behrens, Florent Krzakala, and Lenka Zdeborová. A phase transition between positional and semantic learning in a solvable model of dot-product attention. Advances in Neural Information Processing Systems, 37:36342–36389, 2024

2024
[24]

Fundamental limits of learning in sequence multi-index models and deep attention networks: high-dimensional asymptotics and sharp thresholds

Emanuele Troiani, Hugo Chao Cui, Yatin Dandi, Florent Krzakala, and Lenka Zdeborová. Fundamental limits of learning in sequence multi-index models and deep attention networks: high-dimensional asymptotics and sharp thresholds. InProceedings of the 42nd International Conference on Machine Learning, 2025

2025
[25]

Asymptotics of sgd in sequence-single index models and single-layer attention networks

Luca Arnaboldi, Bruno Loureiro, Ludovic Stephan, Florent Krzakala, and Lenka Zdeborová. Asymptotics of sgd in sequence-single index models and single-layer attention networks. In Advances in Neural Information Processing Systems, volume 38, pages 20611–20645, 2025

2025
[26]

Duranthon, P

O. Duranthon, P. Marion, C. Boyer, B. Loureiro, and L. Zdeborová. Statistical advantage of softmax attention: Insights from single-location regression. InThe Fourteenth International Conference on Learning Representations (ICLR), 2026. arXiv:2509.21936

arXiv 2026
[27]

Andrea Montanari and Basil N. Saeed. Universality of empirical risk minimization. In Proceedings of Thirty Fifth Conference on Learning Theory, volume 178 ofProceedings of Machine Learning Research, pages 4310–4312. PMLR, 2022

2022
[28]

Universality laws for gaussian mixtures in generalized linear models

Yatin Dandi, Ludovic Stephan, Florent Krzakala, Bruno Loureiro, and Lenka Zdeborová. Universality laws for gaussian mixtures in generalized linear models. InAdvances in Neural Information Processing Systems, volume 36, pages 54754–54768, 2023. 13

2023
[29]

Lu, and Subhabrata Sen

Rishabh Dudeja, Yue M. Lu, and Subhabrata Sen. Universality of approximate message passing with semirandom matrices.The Annals of Probability, 51(5):1616–1683, 2023

2023
[30]

Lu and Horng-Tzer Yau

Yue M. Lu and Horng-Tzer Yau. An equivalence principle for the spectrum of random inner-product kernel matrices with polynomial scalings.The Annals of Applied Probability, 35(4):2411–2470, 2025

2025
[31]

Asymptotics of non-convex generalized linear models in high dimensions: A proof of the replica formula,

Matteo Vilucchio, Yatin Dandi, Cedric Gerbelot, and Florent Krzakala. Asymptotics of non-convex generalized linear models in high dimensions: A proof of the replica formula,
[32]

Topological trivialization in non-convex empirical risk minimization, 2026

Andrea Montanari and Basil Saeed. Topological trivialization in non-convex empirical risk minimization, 2026. arxiv:2602.14969

arXiv 2026
[33]

A rank stabilization scaling factor for fine-tuning with lora, 2023

Damjan Kalajdzievski. A rank stabilization scaling factor for fine-tuning with lora, 2023. arXiv:2312.03732

Pith/arXiv arXiv 2023
[34]

Decoupling angles and strength in low-rank adaptation

Massimo Bini, Leander Girrbach, and Zeynep Akata. Decoupling angles and strength in low-rank adaptation. InInternational Conference on Learning Representations, volume 2025, pages 20216–20233, 2025

2025
[35]

A solvable high-dimensional model where nonlinear autoencoders learn structure invisible to PCA while test loss misaligns with generalization

Vicente Conde Mendes, Lorenzo Bardone, Cédric Koller, Jorge Medina Moreira, Vittorio Erba, Emanuele Troiani, and Lenka Zdeborová. A solvable high-dimensional model where nonlinear autoencoders learn structure invisible to PCA while test loss misaligns with generalization. InInternational Conference on Machine Learning, 2026. arXiv:2602.10680

Pith/arXiv arXiv 2026
[36]

Biased generalization in diffusion models, 2026

Jerome Garnier-Brun, Luca Biggio, Davide Beltrame, Marc Mézard, and Luca Saglietti. Biased generalization in diffusion models, 2026. arXiv:2603.03469

arXiv 2026
[37]

Fundamental limits of matrix sensing: Exact asymptotics, universality, and applications

Yizhou Xu, Antoine Maillard, Lenka Zdeborová, and Florent Krzakala. Fundamental limits of matrix sensing: Exact asymptotics, universality, and applications. InConference on Learning Theory, 2025

2025
[38]

without active fine-tuning

Vittorio Erba, Emanuele Troiani, Lenka Zdeborová, and Florent Krzakala. The nuclear route: Sharp asymptotics of erm in overparameterized quadratic networks. InAdvances in Neural Information Processing Systems, volume 38, pages 88862–88901, 2025. 14 A Derivation of the main results In this section we provide the derivation of the results on the asymptotic ...

2025

[1] [1]

Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Liang Wang, and Weizhu Chen

Edward J. Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Liang Wang, and Weizhu Chen. LoRA: Low-rank adaptation of large language models. In International Conference on Learning Representations, 2022. 11

2022

[2] [2]

Parameter-efficient transfer learning for NLP

Neil Houlsby, Andrei Giurgiu, Stanislaw Jastrzebski, Bruna Morrone, Quentin De Larous- silhe, Andrea Gesmundo, Mona Attariyan, and Sylvain Gelly. Parameter-efficient transfer learning for NLP. InInternational Conference on Machine Learning, pages 2790–2799, 2019

2019

[3] [3]

Prefix-tuning: Optimizing continuous prompts for generation

Xiang Lisa Li and Percy Liang. Prefix-tuning: Optimizing continuous prompts for generation. InProceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing, pages 4582– 4597, 2021

2021

[4] [4]

QLoRA: Efficient finetuning of quantized LLMs

Tim Dettmers, Artidoro Pagnoni, Ari Holtzman, and Luke Zettlemoyer. QLoRA: Efficient finetuning of quantized LLMs. InAdvances in Neural Information Processing Systems, volume 36, pages 10088–10115, 2023

2023

[5] [5]

Intrinsic dimensionality explains the effectiveness of language model fine-tuning

Armen Aghajanyan, Sonal Gupta, and Luke Zettlemoyer. Intrinsic dimensionality explains the effectiveness of language model fine-tuning. InProceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing, pages 7319–7328, 2021

2021

[6] [6]

The expressive power of low-rank adaptation

Yuchen Zeng and Kangwook Lee. The expressive power of low-rank adaptation. In International Conference on Learning Representations, 2024. arXiv:2310.17513

arXiv 2024

[7] [7]

LoRA-one: One-step full gradient could suffice for fine-tuning large language models, provably and efficiently

Yuanhe Zhang, Fanghui Liu, and Yudong Chen. LoRA-one: One-step full gradient could suffice for fine-tuning large language models, provably and efficiently. InProceedings of the 42nd International Conference on Machine Learning, volume 267, pages 75513–75574. PMLR, 2025. arXiv:2502.01235

arXiv 2025

[8] [8]

Junsu Kim, Jaeyeon Kim, and Ernest K. Ryu. LoRA training provably converges to a low-rank global minimum or it fails loudly (But it probably won’t fail). InProceedings of the 42nd International Conference on Machine Learning, volume 267, pages 30224–30247. PMLR, 2025. arXiv:2502.09376

arXiv 2025

[9] [9]

Understanding the learning dynamics of lora: A gradient flow perspective on low-rank adaptation in matrix factorization

Ziqing Xu, Hancheng Min, Lachlan Ewen MacDonald, Jinqi Luo, Salma Tarmoun, Enrique Mallada, and Rene Vidal. Understanding the learning dynamics of lora: A gradient flow perspective on low-rank adaptation in matrix factorization. InProceedings of The 28th International Conference on Artificial Intelligence and Statistics, volume 258, pages 4636–4644. PMLR,...

arXiv 2025

[10] [10]

Zi Liang, Haibo Hu, Qingqing Ye, Yaxin Xiao, and Ronghua Li. Does low rank adaptation lead to lower robustness against training-time attacks? InProceedings of the 42nd Inter- national Conference on Machine Learning, volume 267, pages 37181–37207. PMLR, 2025. arXiv:2505.12871

arXiv 2025

[11] [11]

On the convergence rate of lora gradient descent, 2025

Siqiao Mu and Diego Klabjan. On the convergence rate of lora gradient descent, 2025. arXiv preprint arXiv:2512.18248

Pith/arXiv arXiv 2025

[12] [12]

When pre-training hurts LoRA fine-tuning: A dynamical analysis via single-index models, 2026

Gibbs Nwemadji, Bruno Loureiro, and Jean Barbier. When pre-training hurts LoRA fine-tuning: A dynamical analysis via single-index models, 2026. arXiv:2602.02855

Pith/arXiv arXiv 2026

[13] [13]

Why LoRA resists label noise: A theoretical framework for noise-robust parameter-efficient fine-tuning, 2026

Brady Steele. Why LoRA resists label noise: A theoretical framework for noise-robust parameter-efficient fine-tuning, 2026. arXiv:2602.00084

arXiv 2026

[14] [14]

Sharp generalization bounds for foundation models with asymmetric randomized low-rank adapters, 2025

Anastasis Kratsios, Tin Sum Cheng, Aurelien Lucchi, and Haitz Sáez de Ocáriz Borde. Sharp generalization bounds for foundation models with asymmetric randomized low-rank adapters, 2025. arXiv:2506.14530. 12

arXiv 2025

[15] [15]

Lee, and Ernest K

Uijeong Jang, Jason D. Lee, and Ernest K. Ryu. LoRA training in the NTK regime has no spurious local minima. InProceedings of the 41st International Conference on Machine Learning, volume 235, pages 21306–21328. PMLR, 2024. arXiv:2402.11867

arXiv 2024

[16] [16]

Bayes optimal learning of attention-indexed models

Fabrizio Boncoraglio, Emanuele Troiani, Vittorio Erba, and Lenka Zdeborová. Bayes optimal learning of attention-indexed models. InAdvances in Neural Information Processing Systems, volume 38, pages 105029–105074, 2025. arXiv:2506.01582

arXiv 2025

[17] [17]

Single-head attention in high dimensions: A theory of generalization, weights spectra, and scaling laws

Fabrizio Boncoraglio, Vittorio Erba, Emanuele Troiani, Yizhou Xu, Florent Krzakala, and Lenka Zdeborová. Single-head attention in high dimensions: A theory of generalization, weights spectra, and scaling laws. InInternational Conference on Machine Learning, 2026. arXiv:2509.24914

Pith/arXiv arXiv 2026

[18] [18]

Online stochastic gradient descent on non-convex losses from high-dimensional inference.Journal of Machine Learning Research, 22(106):1–51, 2021

Gerard Ben Arous, Reza Gheissari, and Aukosh Jagannath. Online stochastic gradient descent on non-convex losses from high-dimensional inference.Journal of Machine Learning Research, 22(106):1–51, 2021

2021

[19] [19]

Alex Damian, Eshaan Nichani, Rong Ge, and Jason D. Lee. Smoothing the landscape boosts the signal for SGD: Optimal sample complexity for learning single index models. In Advances in Neural Information Processing Systems, volume 36, pages 752–784, 2023

2023

[20] [20]

From high- dimensional and mean-field dynamics to dimensionless odes: A unifying approach to SGD in two-layer networks

Luca Arnaboldi, Ludovic Stephan, Florent Krzakala, and Bruno Loureiro. From high- dimensional and mean-field dynamics to dimensionless odes: A unifying approach to SGD in two-layer networks. InThe Thirty Sixth Annual Conference on Learning Theory, pages 1199–1227. PMLR, 2023

2023

[21] [21]

Learning time-scales in two-layers neural networks.Foundations of Computational Mathematics, 25(5):1627–1710, 2025

Raphaël Berthier, Andrea Montanari, and Kangjie Zhou. Learning time-scales in two-layers neural networks.Foundations of Computational Mathematics, 25(5):1627–1710, 2025

2025

[22] [22]

High-dimensional learning of narrow neural networks.Journal of Statistical Mechanics: Theory and Experiment, 2025(2):023402, 2025

Hugo Cui. High-dimensional learning of narrow neural networks.Journal of Statistical Mechanics: Theory and Experiment, 2025(2):023402, 2025

2025

[23] [23]

A phase transition between positional and semantic learning in a solvable model of dot-product attention

Hugo Cui, Freya Behrens, Florent Krzakala, and Lenka Zdeborová. A phase transition between positional and semantic learning in a solvable model of dot-product attention. Advances in Neural Information Processing Systems, 37:36342–36389, 2024

2024

[24] [24]

Fundamental limits of learning in sequence multi-index models and deep attention networks: high-dimensional asymptotics and sharp thresholds

Emanuele Troiani, Hugo Chao Cui, Yatin Dandi, Florent Krzakala, and Lenka Zdeborová. Fundamental limits of learning in sequence multi-index models and deep attention networks: high-dimensional asymptotics and sharp thresholds. InProceedings of the 42nd International Conference on Machine Learning, 2025

2025

[25] [25]

Asymptotics of sgd in sequence-single index models and single-layer attention networks

Luca Arnaboldi, Bruno Loureiro, Ludovic Stephan, Florent Krzakala, and Lenka Zdeborová. Asymptotics of sgd in sequence-single index models and single-layer attention networks. In Advances in Neural Information Processing Systems, volume 38, pages 20611–20645, 2025

2025

[26] [26]

Duranthon, P

O. Duranthon, P. Marion, C. Boyer, B. Loureiro, and L. Zdeborová. Statistical advantage of softmax attention: Insights from single-location regression. InThe Fourteenth International Conference on Learning Representations (ICLR), 2026. arXiv:2509.21936

arXiv 2026

[27] [27]

Andrea Montanari and Basil N. Saeed. Universality of empirical risk minimization. In Proceedings of Thirty Fifth Conference on Learning Theory, volume 178 ofProceedings of Machine Learning Research, pages 4310–4312. PMLR, 2022

2022

[28] [28]

Universality laws for gaussian mixtures in generalized linear models

Yatin Dandi, Ludovic Stephan, Florent Krzakala, Bruno Loureiro, and Lenka Zdeborová. Universality laws for gaussian mixtures in generalized linear models. InAdvances in Neural Information Processing Systems, volume 36, pages 54754–54768, 2023. 13

2023

[29] [29]

Lu, and Subhabrata Sen

Rishabh Dudeja, Yue M. Lu, and Subhabrata Sen. Universality of approximate message passing with semirandom matrices.The Annals of Probability, 51(5):1616–1683, 2023

2023

[30] [30]

Lu and Horng-Tzer Yau

Yue M. Lu and Horng-Tzer Yau. An equivalence principle for the spectrum of random inner-product kernel matrices with polynomial scalings.The Annals of Applied Probability, 35(4):2411–2470, 2025

2025

[31] [31]

Asymptotics of non-convex generalized linear models in high dimensions: A proof of the replica formula,

Matteo Vilucchio, Yatin Dandi, Cedric Gerbelot, and Florent Krzakala. Asymptotics of non-convex generalized linear models in high dimensions: A proof of the replica formula,

[32] [32]

Topological trivialization in non-convex empirical risk minimization, 2026

Andrea Montanari and Basil Saeed. Topological trivialization in non-convex empirical risk minimization, 2026. arxiv:2602.14969

arXiv 2026

[33] [33]

A rank stabilization scaling factor for fine-tuning with lora, 2023

Damjan Kalajdzievski. A rank stabilization scaling factor for fine-tuning with lora, 2023. arXiv:2312.03732

Pith/arXiv arXiv 2023

[34] [34]

Decoupling angles and strength in low-rank adaptation

Massimo Bini, Leander Girrbach, and Zeynep Akata. Decoupling angles and strength in low-rank adaptation. InInternational Conference on Learning Representations, volume 2025, pages 20216–20233, 2025

2025

[35] [35]

A solvable high-dimensional model where nonlinear autoencoders learn structure invisible to PCA while test loss misaligns with generalization

Vicente Conde Mendes, Lorenzo Bardone, Cédric Koller, Jorge Medina Moreira, Vittorio Erba, Emanuele Troiani, and Lenka Zdeborová. A solvable high-dimensional model where nonlinear autoencoders learn structure invisible to PCA while test loss misaligns with generalization. InInternational Conference on Machine Learning, 2026. arXiv:2602.10680

Pith/arXiv arXiv 2026

[36] [36]

Biased generalization in diffusion models, 2026

Jerome Garnier-Brun, Luca Biggio, Davide Beltrame, Marc Mézard, and Luca Saglietti. Biased generalization in diffusion models, 2026. arXiv:2603.03469

arXiv 2026

[37] [37]

Fundamental limits of matrix sensing: Exact asymptotics, universality, and applications

Yizhou Xu, Antoine Maillard, Lenka Zdeborová, and Florent Krzakala. Fundamental limits of matrix sensing: Exact asymptotics, universality, and applications. InConference on Learning Theory, 2025

2025

[38] [38]

without active fine-tuning

Vittorio Erba, Emanuele Troiani, Lenka Zdeborová, and Florent Krzakala. The nuclear route: Sharp asymptotics of erm in overparameterized quadratic networks. InAdvances in Neural Information Processing Systems, volume 38, pages 88862–88901, 2025. 14 A Derivation of the main results In this section we provide the derivation of the results on the asymptotic ...

2025