Understanding Catastrophic Forgetting In LoRA via Mean-Field Attention Dynamics

Hugo Koubbi; Louis Hernandez; Matthieu Boussard

arxiv: 2402.15415 · v2 · pith:DAAUN7TYnew · submitted 2024-02-23 · 💻 cs.LG · math.DS· stat.ML

Understanding Catastrophic Forgetting In LoRA via Mean-Field Attention Dynamics

Hugo Koubbi , Louis Hernandez , Matthieu Boussard This is my paper

Pith reviewed 2026-05-24 03:30 UTC · model grok-4.3

classification 💻 cs.LG math.DSstat.ML

keywords LoRAcatastrophic forgettingmean-field self-attentionphase transitiontransformer depthlow-rank perturbationfine-tuningdynamical systems

0 comments

The pith

A mean-field self-attention model identifies phase transitions that separate forgetting from non-forgetting regimes under LoRA fine-tuning.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper constructs a tractable mean-field self-attention toy model in which tokens behave as an interacting particle system and LoRA enters as a low-rank perturbation. Using tools from partial differential equations and dynamical systems, the authors locate regimes that suggest a phase transition between forgetting and non-forgetting behavior. One transition depends on the size of the perturbation; another depends on transformer depth. They also derive explicit bounds on the time until the model deviates from its pre-fine-tuning behavior. A reader would care because these transitions and bounds supply concrete conditions under which LoRA preserves or erases earlier knowledge.

Core claim

In the mean-field self-attention toy model, tokens evolve as an interacting particle system and LoRA acts as a low-rank perturbation; regimes exist that suggest a phase transition between forgetting and non-forgetting, one transition appearing with respect to the norm of the perturbation and the other with respect to transformer depth. The time-to-deviation is bounded in terms of perturbation size and spectral quantities, and the predicted trends match experiments on real models.

What carries the argument

The mean-field self-attention toy model, in which tokens evolve as an interacting particle system and LoRA is introduced as a low-rank perturbation.

Load-bearing premise

The tractable mean-field self-attention toy model with tokens as an interacting particle system and LoRA as a low-rank perturbation sufficiently captures the essential mechanisms of catastrophic forgetting in real transformer models under LoRA fine-tuning.

What would settle it

Measure forgetting curves on transformers of increasing depth while varying the Frobenius norm of the LoRA update; check whether the location of the observed transition in forgetting rate matches the phase boundary predicted by the mean-field analysis.

Figures

Figures reproduced from arXiv: 2402.15415 by Hugo Koubbi, Louis Hernandez, Matthieu Boussard.

**Figure 1.** Figure 1: The eigenvalues of V in the pre-trained ALBERT in head 5 and in the right 10. We notice that the Value matrix spectrum on the left has a gap between the largest eigenvalue and the second-largest eigenvalue. However, this is not the case for the second matrix on the right which has a small gap between the largest eigenvalue and the second largest. The theorem (5.4) is relevant for the second matrix, because… view at source ↗

**Figure 2.** Figure 2: Illustration of Theorem 5.4 with d = 2 and n = 20. We have chosen an initialization of the tokens. On the first column, we display the dynamics of the tokens with Q = K = V = I2 and in the second column, the one with Q = K = I2 and V˜ = I2 − εe2e T 2 with ε = 0.01. proof suggested. This might indicate a limitation in our proof technique, which is something we are highlighting for further investigation. We … view at source ↗

**Figure 3.** Figure 3: Phase diagram for the δ-cluster state (with δ = 0.1 and a constant initialization of tokens). The horizontal axis represents the norm of the perturbation ε, and the vertical axis signifies the elapsed time, both on logarithmic scales. The diagram is divided into three distinct phases. In the red phase, no clusters are present. The boundary line, represented by a dotted pattern, indicates the transition po… view at source ↗

**Figure 4.** Figure 4: Illustration of the Theorem 6.2 with d = 2 and n = 20. On the first column, Q = K = I2 and in the second column, Q˜ = K˜ = e1e T 1 . In both cases V = V˜ = I2. We observe that the tokens, on the right column, converge towards a two-point clustering which is on affine hyperplans. 7. Representations Learned by LoRA In this subsection, we investigate how LoRA fine-tuning enables the emergence of new represent… view at source ↗

**Figure 5.** Figure 5: Accuracies of LoRA fine-tuned Transformers on [PITH_FULL_IMAGE:figures/full_fig_p008_5.png] view at source ↗

**Figure 6.** Figure 6: Histogram of eigenvalues of AT = QT K in the 11th Head of model BERT XL. The y-axis is in log scale. The rank of A is roughly 400 meanwhile the embedding space is of dimension 2048. efforts should concentrate on the normalized dynamics of self-attention, as described by Geshkovski et al. (2023). 10. Acknowlegments The authors thank Cyril Letrouit and Borjan Geshkovski for valuable discussions at an early s… view at source ↗

**Figure 7.** Figure 7: Loss curves and accuracies curves for the training of the Transformers model on modular addition with different [PITH_FULL_IMAGE:figures/full_fig_p037_7.png] view at source ↗

**Figure 8.** Figure 8: Spectrum of the learned values matrix at the end of training of the Transformers architecture that we implemented [PITH_FULL_IMAGE:figures/full_fig_p037_8.png] view at source ↗

**Figure 9.** Figure 9: We display the spectrum of the values matrices LoRA i.e. the spectrum of [PITH_FULL_IMAGE:figures/full_fig_p038_9.png] view at source ↗

**Figure 10.** Figure 10: We display the accuracies of the LoRA Transformers on modular addition and modular subtraction. [PITH_FULL_IMAGE:figures/full_fig_p038_10.png] view at source ↗

**Figure 11.** Figure 11: Histograms of the eigenvalues of matrix Ah = QT h Kh where h denotes the number of attention head in a Albert XL model. The y-axis is in log-scale. The number of eigenvalues associated with the 0 eigenvalue is very large (roughly 200 ) which implies that the matrix is low-rank. 1 2 3 4 Singular Value 0 50 100 150 200 250 300 350 Frequency (a) Layer 0 1 2 3 4 Singular Value 0 20 40 60 80 100 120 Frequency … view at source ↗

**Figure 12.** Figure 12: The SVD values show an interesting behavior. We can see for all blocks that there are outliers, and the distribution [PITH_FULL_IMAGE:figures/full_fig_p040_12.png] view at source ↗

**Figure 13.** Figure 13: Spectrum of different values Vh and V˜ h matrix parameters for different layers h. The spectrum of both V and V˜ seems similar to a random matrix of the Complex Ginibre Ensemble (random matrix with i.i.d Gaussian variables at each index). The spectrum of the early layers is less localized than the layers of the late layers in the sense that it seems that there are more outliers in the first layers. G.3.4.… view at source ↗

**Figure 14.** Figure 14: Rank of KT Q matrices for each of the 32 attention blocks of LLama-2-7b developed in this paper does not explain the behavior in Llama2, and we need different mathematical theories to explain success of LoRA in Llama2. 42 [PITH_FULL_IMAGE:figures/full_fig_p042_14.png] view at source ↗

**Figure 15.** Figure 15: For each layer, the eigenvalues learned by the LoRA algorithm i.e. the eigenvalues of [PITH_FULL_IMAGE:figures/full_fig_p043_15.png] view at source ↗

**Figure 16.** Figure 16: Average scalar product across tokens, for each layer in the augmented Llama 2 7b model, the vertical dashed line [PITH_FULL_IMAGE:figures/full_fig_p044_16.png] view at source ↗

**Figure 17.** Figure 17: Distribution of scalar products after passing through several layers of the model, for the dummy text example. [PITH_FULL_IMAGE:figures/full_fig_p044_17.png] view at source ↗

**Figure 18.** Figure 18: Distribution of scalar products after passing through several layers of the model, for the Wikipedia example. [PITH_FULL_IMAGE:figures/full_fig_p044_18.png] view at source ↗

read the original abstract

Low-Rank Adaptation (LoRA) is the dominant parameter-efficient fine-tuning method due to its favorable compute-performance trade-off, yet it suffers from catastrophic forgetting. We study forgetting through a tractable _mean-field self-attention_ toy model, where tokens evolve as an interacting particle system and LoRA acts as a low-rank perturbation. Using tools from partial differential equations and dynamical systems, we characterize regimes suggesting a phase transition between forgetting and non-forgetting behavior. We show that one phase transition appears with respect to the norm of the perturbation, and the other with respect to the depth of the Transformers. We further bound the time-to-deviation in terms of the perturbation size and spectral quantities, and corroborate the predicted trends with experiments and exploratory analyses on real models under LoRA fine-tuning.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The mean-field toy model yields explicit phase transitions and time bounds for LoRA forgetting, but the experiments only check trends and the link to real transformers stays tentative.

read the letter

The paper's core move is to replace the usual empirical study of LoRA forgetting with a mean-field self-attention model in which tokens act as particles and the LoRA update is a low-rank perturbation. From the resulting PDE they extract two phase transitions—one controlled by perturbation norm, one by depth—plus explicit bounds on time-to-deviation in terms of those quantities and spectral data. That is the new piece: a dynamical-systems characterization where earlier work mostly reported observations without such structure. The real-model runs are presented as corroborating the qualitative trends, which at least shows the toy model is not completely detached from practice. The derivations appear to use standard tools from interacting particle systems, so the formal steps are at least in a familiar lane. The soft spot is exactly the one the stress-test flags. All the sharp claims live inside the continuous approximation; the experiments only match directions, not the predicted thresholds or the time bounds. If finite-width effects or the precise coupling of LoRA to the attention matrix matter more than the model allows, the phase transitions become artifacts rather than guides for hyper-parameter choice. The abstract does not claim stronger validation, so the gap is real but not hidden. This is for readers who already work with mean-field limits or dynamical systems in attention and want to see them applied to adaptation. It is coherent on its own terms and shows honest engagement with the problem, so it deserves referee time even though the practical payoff will need more direct testing of the transition points.

Referee Report

3 major / 2 minor

Summary. The paper introduces a mean-field self-attention toy model in which tokens evolve as an interacting particle system and LoRA fine-tuning is represented as a low-rank perturbation. Using PDE and dynamical-systems tools, the authors derive regimes indicating phase transitions in forgetting behavior—one with respect to the norm of the perturbation and one with respect to transformer depth—along with bounds on time-to-deviation expressed in terms of perturbation size and spectral quantities. These predictions are said to be corroborated by experiments and exploratory analyses on real transformer models under LoRA fine-tuning.

Significance. If the central claims hold, the work supplies a tractable dynamical-systems framework for analyzing catastrophic forgetting under LoRA, including explicit phase-transition thresholds and time-to-deviation bounds. The explicit use of mean-field limits and spectral quantities to obtain parameter-free regime characterizations is a methodological strength that could guide future mitigation strategies. The provision of real-model experiments that test predicted trends adds empirical grounding inside the manuscript's scope.

major comments (3)

[§3 (model definition) and §5 (experiments)] The central claims rest on the mean-field particle-system approximation reproducing the dominant attention dynamics that drive forgetting. No section supplies a quantitative comparison (e.g., evolution of the attention matrix or token-interaction statistics) between the continuous toy model and the finite-width, discrete-token behavior of the real transformers used in the experiments; without such a check, the identified transitions remain internal to the toy model.
[§5] §5 states that real-model experiments 'corroborate the predicted trends.' This is weaker than a direct test of the derived transition thresholds (critical perturbation norm or depth value). The manuscript therefore does not establish that the phase boundaries obtained from the PDE analysis are predictive rather than artifacts of the continuous approximation.
[§4 (dynamical analysis)] The time-to-deviation bound is expressed in terms of spectral quantities of the attention operator. The manuscript does not characterize how these spectral quantities themselves evolve under the low-rank LoRA update inside the mean-field dynamics, leaving the practical utility of the bound dependent on an unstated assumption of spectral stability.

minor comments (2)

[§2–3] Notation for the mean-field measure and the low-rank perturbation operator should be introduced with a single consolidated table to avoid repeated re-definition across sections.
[§5] The experimental section would benefit from an explicit statement of the data-exclusion criteria and the precise definition of 'forgetting' used to generate the reported curves.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the detailed and constructive comments. We address each major comment below with clarifications on the manuscript's scope and indicate planned revisions where we agree changes are warranted.

read point-by-point responses

Referee: [§3 (model definition) and §5 (experiments)] The central claims rest on the mean-field particle-system approximation reproducing the dominant attention dynamics that drive forgetting. No section supplies a quantitative comparison (e.g., evolution of the attention matrix or token-interaction statistics) between the continuous toy model and the finite-width, discrete-token behavior of the real transformers used in the experiments; without such a check, the identified transitions remain internal to the toy model.

Authors: We agree that a quantitative comparison of attention dynamics would strengthen the link between the toy model and real transformers. In the revised manuscript we will add a controlled comparison of attention matrix evolution and token-interaction statistics on a small-scale real transformer setup. revision: yes
Referee: [§5] §5 states that real-model experiments 'corroborate the predicted trends.' This is weaker than a direct test of the derived transition thresholds (critical perturbation norm or depth value). The manuscript therefore does not establish that the phase boundaries obtained from the PDE analysis are predictive rather than artifacts of the continuous approximation.

Authors: The manuscript's stated scope is to identify phase-transition regimes inside the mean-field model and to show that real LoRA fine-tuning exhibits qualitatively matching trends. We do not claim the exact numerical thresholds are directly predictive on real models. We will revise §5 to clarify this distinction explicitly. revision: partial
Referee: [§4 (dynamical analysis)] The time-to-deviation bound is expressed in terms of spectral quantities of the attention operator. The manuscript does not characterize how these spectral quantities themselves evolve under the low-rank LoRA update inside the mean-field dynamics, leaving the practical utility of the bound dependent on an unstated assumption of spectral stability.

Authors: The bound is stated in terms of the initial spectral quantities of the attention operator. We will add a remark in §4 clarifying the assumption of approximate spectral stability for small perturbations and noting the evolution of these quantities as an open question. revision: partial

Circularity Check

0 steps flagged

No circularity: derivation self-contained within explicitly defined toy model

full rationale

The paper constructs an explicit mean-field self-attention toy model (tokens as interacting particles, LoRA as low-rank perturbation) and applies standard PDE/dynamical-systems analysis to derive phase transitions and time-to-deviation bounds inside that model. These results are mathematical consequences of the stated assumptions and equations rather than reductions to fitted data, self-citations, or renamed empirical patterns. Real-model experiments are described only as corroborating trends, not as inputs that define or force the predictions. No load-bearing step reduces by construction to its own inputs.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the validity of the mean-field approximation for self-attention and the modeling of LoRA as a low-rank perturbation; no free parameters or invented entities are identifiable from the abstract alone.

axioms (1)

domain assumption Tokens evolve as an interacting particle system under mean-field self-attention.
Invoked to justify reducing the attention mechanism to a tractable PDE/dynamical system.

pith-pipeline@v0.9.0 · 5669 in / 1355 out tokens · 66908 ms · 2026-05-24T03:30:06.952274+00:00 · methodology

discussion (0)

Forward citations

Cited by 2 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Perceptrons and localization of attention's mean-field landscape
cs.LG 2026-01 unverdicted novelty 7.0

In the mean-field limit of attention with perceptron blocks, critical points of the energy landscape are generically atomic and localized on subsets of the unit sphere.
Quantitative Clustering in Mean-Field Transformer Models
cs.LG 2025-04 unverdicted novelty 5.0

Mean-field transformer models synchronize to a Dirac point mass exponentially fast with explicit quantitative rates under suitable parameter assumptions.

Reference graph

Works this paper leans on

39 extracted references · 39 canonical work pages · cited by 2 Pith papers

[1]

write newline

" write newline "" before.all 'output.state := FUNCTION n.dashify 't := "" t empty not t #1 #1 substring "-" = t #1 #2 substring "--" = not "--" * t #2 global.max substring 't := t #1 #1 substring "-" = "-" * t #2 global.max substring 't := while if t #1 #1 substring * t #2 global.max substring 't := if while FUNCTION format.date year duplicate empty "emp...

work page
[2]

Gradient flows: in metric spaces and in the space of probability measures

Ambrosio, L., Gigli, N., and Savar \'e , G. Gradient flows: in metric spaces and in the space of probability measures. Springer Science & Business Media, 2005

work page 2005
[3]

I., Afrajmovich, V., Il'yashenko, Y

Arnold, V. I., Afrajmovich, V., Il'yashenko, Y. S., and Shil'nikov, L. Dynamical systems V: bifurcation theory and catastrophe theory, volume 5. Springer Science & Business Media, 2013

work page 2013
[4]

Transformers learn through gradual rank increase

Boix-Adsera, E., Littwin, E., Abbe, E., Bengio, S., and Susskind, J. Transformers learn through gradual rank increase. Advances in neural information processing systems, 36, 2024

work page 2024
[5]

Concentration Inequalities: A Nonasymptotic Theory of Independence

Boucheron, S., Lugosi, G., and Massart, P. Concentration Inequalities: A Nonasymptotic Theory of Independence. Oxford Press University, 2013

work page 2013
[6]

Understanding the regularity of self-attention with optimal transport, 2023

Castin, V., Ablin, P., and Peyré, G. Understanding the regularity of self-attention with optimal transport, 2023

work page 2023
[7]

Q., Rubanova, Y., Bettencourt, J., and Duvenaud, D

Chen, T. Q., Rubanova, Y., Bettencourt, J., and Duvenaud, D. Neural ordinary differential equations. In Bengio, S., Wallach, H. M., Larochelle, H., Grauman, K., Cesa - Bianchi, N., and Garnett, R. (eds.), Advances in Neural Information Processing Systems 31: Annual Conference on Neural Information Processing Systems 2018, NeurIPS 2018, December 3-8, 2018,...

work page 2018
[8]

Stochastic deep networks

De Bie, G., Peyr \'e , G., and Cuturi, M. Stochastic deep networks. In International Conference on Machine Learning, pp.\ 1556--1565. PMLR, 2019

work page 2019
[9]

QL o RA : Efficient finetuning of quantized LLM s

Dettmers, T., Pagnoni, A., Holtzman, A., and Zettlemoyer, L. QL o RA : Efficient finetuning of quantized LLM s. In Thirty-seventh Conference on Neural Information Processing Systems, 2023

work page 2023
[10]

Attention is not all you need: Pure attention loses rank doubly exponentially with depth

Dong, Y., Cordonnier, J.-B., and Loukas, A. Attention is not all you need: Pure attention loses rank doubly exponentially with depth. In International Conference on Machine Learning, pp.\ 2793--2803. PMLR, 2021

work page 2021
[11]

A mathematical perspective on transformers, 2023

G eshkovski, B., L etrouit, C., P olyanskiy, Y., and R igollet, P. A mathematical perspective on transformers, 2023

work page 2023
[12]

The emergence of clusters in self-attention dynamics

Geshkovski, B., Letrouit, C., Polyanskiy, Y., and Rigollet, P. The emergence of clusters in self-attention dynamics. Advances in neural information processing systems, 36, 2024

work page 2024
[13]

Stable architectures for deep neural networks

Haber, E. and Ruthotto, L. Stable architectures for deep neural networks. Inverse Problems, 34 0 (1): 0 014004, dec 2017. ISSN 1361-6420. doi:10.1088/1361-6420/aa9a90

work page doi:10.1088/1361-6420/aa9a90 2017
[14]

Deep residual learning for image recognition

He, K., Zhang, X., Ren, S., and Sun, J. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp.\ 770--778, 2016

work page 2016
[15]

J., Shen, Y., Wallis, P., Allen - Zhu, Z., Li, Y., Wang, S., Wang, L., and Chen, W

Hu, E. J., Shen, Y., Wallis, P., Allen - Zhu, Z., Li, Y., Wang, S., Wang, L., and Chen, W. Lora: Low-rank adaptation of large language models. In The Tenth International Conference on Learning Representations, ICLR 2022, Virtual Event, April 25-29, 2022 . OpenReview.net, 2022

work page 2022
[16]

and Motsch, S

Jabin, P.-E. and Motsch, S. Clustering and asymptotic behavior in opinion formation. Journal of Differential Equations, 257 0 (11): 0 4165--4187, 2014. ISSN 0022-0396. doi:https://doi.org/10.1016/j.jde.2014.08.005

work page doi:10.1016/j.jde.2014.08.005 2014
[17]

Q., Sablayrolles, A., Mensch, A., Bamford, C., Chaplot, D

Jiang, A. Q., Sablayrolles, A., Mensch, A., Bamford, C., Chaplot, D. S., de las Casas, D., Bressand, F., Lengyel, G., Lample, G., Saulnier, L., Lavaud, L. R., Lachaux, M.-A., Stock, P., Scao, T. L., Lavril, T., Wang, T., Lacroix, T., and Sayed, W. E. Mistral 7b, 2023

work page 2023
[18]

The lipschitz constant of self-attention

Kim, H., Papamakarios, G., and Mnih, A. The lipschitz constant of self-attention. In Meila, M. and Zhang, T. (eds.), Proceedings of the 38th International Conference on Machine Learning, ICML 2021, 18-24 July 2021, Virtual Event , volume 139 of Proceedings of Machine Learning Research, pp.\ 5562--5571. PMLR , 2021

work page 2021
[19]

Self-entrainment of a population of coupled non-linear oscillators

Kuramoto, Y. Self-entrainment of a population of coupled non-linear oscillators. In Araki, H. (ed.), International Symposium on Mathematical Problems in Theoretical Physics, pp.\ 420--422, Berlin, Heidelberg, 1975. Springer Berlin Heidelberg. ISBN 978-3-540-37509-8

work page 1975
[20]

A discrete nonlinear and non-autonomous model of consensus formation

Kurause, U. A discrete nonlinear and non-autonomous model of consensus formation. Communications in Difference Equations, 2000

work page 2000
[21]

Understanding and improving transformer from a multi-particle dynamic system point of view, 2019

Lu, Y., Li, Z., He, D., Sun, Z., Dong, B., Qin, T., Wang, L., and Liu, T.-Y. Understanding and improving transformer from a multi-particle dynamic system point of view, 2019

work page 2019
[22]

Peft: State-of-the-art parameter-efficient fine-tuning methods

Mangrulkar, S., Gugger, S., Debut, L., Belkada, Y., Paul, S., and Bossan, B. Peft: State-of-the-art parameter-efficient fine-tuning methods. https://github.com/huggingface/peft, 2022

work page 2022
[23]

Effects of parameter norm growth during transformer training: Inductive bias from gradient descent, 2023

Merrill, W., Ramanujan, V., Goldberg, Y., Schwartz, R., and Smith, N. Effects of parameter norm growth during transformer training: Inductive bias from gradient descent, 2023

work page 2023
[24]

and Tadmor, E

Motsch, S. and Tadmor, E. Heterophilious dynamics enhances consensus. 10.100 Review, 56 0 (4): 0 577--621, 2014. ISSN 00361445, 10957200

work page 2014
[25]

P., and Lucchi, A

Noci, L., Anagnostidis, S., Biggio, L., Orvieto, A., Singh, S. P., and Lucchi, A. Signal propagation in transformers: Theoretical perspectives and the role of rank collapse. Advances in Neural Information Processing Systems, 35: 0 27198--27211, 2022

work page 2022
[26]

Computational optimal transport: With applications to data science

Peyr \'e , G., Cuturi, M., et al. Computational optimal transport: With applications to data science. Foundations and Trends in Machine Learning , 11 0 (5-6): 0 355--607, 2019

work page 2019
[27]

and Rossi, F

Piccoli, B. and Rossi, F. Transport equation with nonlocal velocity in wasserstein spaces: Convergence of numerical schemes. 06 2011

work page 2011
[28]

E., Ablin, P., Blondel, M., and Peyr \'e , G

Sander, M. E., Ablin, P., Blondel, M., and Peyr \'e , G. Sinkformers: Transformers with doubly stochastic attention. In International Conference on Artificial Intelligence and Statistics, pp.\ 3515--3530. PMLR, 2022

work page 2022
[29]

Optimal transport for applied mathematicians

Santambrogio, F. Optimal transport for applied mathematicians. Progress in Nonlinear Differential Equations and Their Applications, 2015

work page 2015
[30]

Swarming: hydrodynamic alignment with pressure

Tadmor, E. Swarming: hydrodynamic alignment with pressure. Bulletin of the American Mathematical Society, pp.\ 285–325, 2023

work page 2023
[31]

A., Li, Y., Thrampoulidis, C., and Oymak, S

Tarzanagh, D. A., Li, Y., Thrampoulidis, C., and Oymak, S. Transformers as support vector machines. In NeurIPS 2023 Workshop on Mathematics of Modern Machine Learning, 2023

work page 2023
[32]

Touvron, H., Martin, L., Stone, K., Albert, P., Almahairi, A., Babaei, Y., Bashlykov, N., Batra, S., Bhargava, P., Bhosale, S., Bikel, D., Blecher, L., Ferrer, C. C., Chen, M., Cucurull, G., Esiobu, D., Fernandes, J., Fu, J., Fu, W., Fuller, B., Gao, C., Goswami, V., Goyal, N., Hartshorn, A., Hosseini, S., Hou, R., Inan, H., Kardas, M., Kerkez, V., Khabsa...

work page 2023
[33]

N., Kaiser, ., and Polosukhin, I

Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, ., and Polosukhin, I. Attention is all you need. Advances in neural information processing systems, 30, 2017

work page 2017
[34]

Optimal transport: Old and new

Villani, C. Optimal transport: Old and new. Springer, 2008

work page 2008
[35]

Vuckovic, J., Baratin, A., and des Combes, R. T. A mathematical theory of attention, 2020

work page 2020
[36]

Vuckovic, J., Baratin, A., and des Combes, R. T. On the regularity of attention, 2021

work page 2021
[37]

Orthogonal subspace learning for language model continual learning, 2023

Wang, X., Chen, T., Ge, Q., Xia, H., Bao, R., Zheng, R., Zhang, Q., Gui, T., and Huang, X. Orthogonal subspace learning for language model continual learning, 2023

work page 2023
[38]

A proposal on machine learning via dynamical systems

Weinan, E. A proposal on machine learning via dynamical systems. Communications in Mathematics and Statistics, 1 0 (5): 0 1--11, 2017

work page 2017
[39]

and Bruna, J

Zweig, A. and Bruna, J. A functional perspective on learning symmetric functions with neural networks. In International Conference on Machine Learning, pp.\ 13023--13032. PMLR, 2021

work page 2021

[1] [1]

write newline

" write newline "" before.all 'output.state := FUNCTION n.dashify 't := "" t empty not t #1 #1 substring "-" = t #1 #2 substring "--" = not "--" * t #2 global.max substring 't := t #1 #1 substring "-" = "-" * t #2 global.max substring 't := while if t #1 #1 substring * t #2 global.max substring 't := if while FUNCTION format.date year duplicate empty "emp...

work page

[2] [2]

Gradient flows: in metric spaces and in the space of probability measures

Ambrosio, L., Gigli, N., and Savar \'e , G. Gradient flows: in metric spaces and in the space of probability measures. Springer Science & Business Media, 2005

work page 2005

[3] [3]

I., Afrajmovich, V., Il'yashenko, Y

Arnold, V. I., Afrajmovich, V., Il'yashenko, Y. S., and Shil'nikov, L. Dynamical systems V: bifurcation theory and catastrophe theory, volume 5. Springer Science & Business Media, 2013

work page 2013

[4] [4]

Transformers learn through gradual rank increase

Boix-Adsera, E., Littwin, E., Abbe, E., Bengio, S., and Susskind, J. Transformers learn through gradual rank increase. Advances in neural information processing systems, 36, 2024

work page 2024

[5] [5]

Concentration Inequalities: A Nonasymptotic Theory of Independence

Boucheron, S., Lugosi, G., and Massart, P. Concentration Inequalities: A Nonasymptotic Theory of Independence. Oxford Press University, 2013

work page 2013

[6] [6]

Understanding the regularity of self-attention with optimal transport, 2023

Castin, V., Ablin, P., and Peyré, G. Understanding the regularity of self-attention with optimal transport, 2023

work page 2023

[7] [7]

Q., Rubanova, Y., Bettencourt, J., and Duvenaud, D

Chen, T. Q., Rubanova, Y., Bettencourt, J., and Duvenaud, D. Neural ordinary differential equations. In Bengio, S., Wallach, H. M., Larochelle, H., Grauman, K., Cesa - Bianchi, N., and Garnett, R. (eds.), Advances in Neural Information Processing Systems 31: Annual Conference on Neural Information Processing Systems 2018, NeurIPS 2018, December 3-8, 2018,...

work page 2018

[8] [8]

Stochastic deep networks

De Bie, G., Peyr \'e , G., and Cuturi, M. Stochastic deep networks. In International Conference on Machine Learning, pp.\ 1556--1565. PMLR, 2019

work page 2019

[9] [9]

QL o RA : Efficient finetuning of quantized LLM s

Dettmers, T., Pagnoni, A., Holtzman, A., and Zettlemoyer, L. QL o RA : Efficient finetuning of quantized LLM s. In Thirty-seventh Conference on Neural Information Processing Systems, 2023

work page 2023

[10] [10]

Attention is not all you need: Pure attention loses rank doubly exponentially with depth

Dong, Y., Cordonnier, J.-B., and Loukas, A. Attention is not all you need: Pure attention loses rank doubly exponentially with depth. In International Conference on Machine Learning, pp.\ 2793--2803. PMLR, 2021

work page 2021

[11] [11]

A mathematical perspective on transformers, 2023

G eshkovski, B., L etrouit, C., P olyanskiy, Y., and R igollet, P. A mathematical perspective on transformers, 2023

work page 2023

[12] [12]

The emergence of clusters in self-attention dynamics

Geshkovski, B., Letrouit, C., Polyanskiy, Y., and Rigollet, P. The emergence of clusters in self-attention dynamics. Advances in neural information processing systems, 36, 2024

work page 2024

[13] [13]

Stable architectures for deep neural networks

Haber, E. and Ruthotto, L. Stable architectures for deep neural networks. Inverse Problems, 34 0 (1): 0 014004, dec 2017. ISSN 1361-6420. doi:10.1088/1361-6420/aa9a90

work page doi:10.1088/1361-6420/aa9a90 2017

[14] [14]

Deep residual learning for image recognition

He, K., Zhang, X., Ren, S., and Sun, J. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp.\ 770--778, 2016

work page 2016

[15] [15]

J., Shen, Y., Wallis, P., Allen - Zhu, Z., Li, Y., Wang, S., Wang, L., and Chen, W

Hu, E. J., Shen, Y., Wallis, P., Allen - Zhu, Z., Li, Y., Wang, S., Wang, L., and Chen, W. Lora: Low-rank adaptation of large language models. In The Tenth International Conference on Learning Representations, ICLR 2022, Virtual Event, April 25-29, 2022 . OpenReview.net, 2022

work page 2022

[16] [16]

and Motsch, S

Jabin, P.-E. and Motsch, S. Clustering and asymptotic behavior in opinion formation. Journal of Differential Equations, 257 0 (11): 0 4165--4187, 2014. ISSN 0022-0396. doi:https://doi.org/10.1016/j.jde.2014.08.005

work page doi:10.1016/j.jde.2014.08.005 2014

[17] [17]

Q., Sablayrolles, A., Mensch, A., Bamford, C., Chaplot, D

Jiang, A. Q., Sablayrolles, A., Mensch, A., Bamford, C., Chaplot, D. S., de las Casas, D., Bressand, F., Lengyel, G., Lample, G., Saulnier, L., Lavaud, L. R., Lachaux, M.-A., Stock, P., Scao, T. L., Lavril, T., Wang, T., Lacroix, T., and Sayed, W. E. Mistral 7b, 2023

work page 2023

[18] [18]

The lipschitz constant of self-attention

Kim, H., Papamakarios, G., and Mnih, A. The lipschitz constant of self-attention. In Meila, M. and Zhang, T. (eds.), Proceedings of the 38th International Conference on Machine Learning, ICML 2021, 18-24 July 2021, Virtual Event , volume 139 of Proceedings of Machine Learning Research, pp.\ 5562--5571. PMLR , 2021

work page 2021

[19] [19]

Self-entrainment of a population of coupled non-linear oscillators

Kuramoto, Y. Self-entrainment of a population of coupled non-linear oscillators. In Araki, H. (ed.), International Symposium on Mathematical Problems in Theoretical Physics, pp.\ 420--422, Berlin, Heidelberg, 1975. Springer Berlin Heidelberg. ISBN 978-3-540-37509-8

work page 1975

[20] [20]

A discrete nonlinear and non-autonomous model of consensus formation

Kurause, U. A discrete nonlinear and non-autonomous model of consensus formation. Communications in Difference Equations, 2000

work page 2000

[21] [21]

Understanding and improving transformer from a multi-particle dynamic system point of view, 2019

Lu, Y., Li, Z., He, D., Sun, Z., Dong, B., Qin, T., Wang, L., and Liu, T.-Y. Understanding and improving transformer from a multi-particle dynamic system point of view, 2019

work page 2019

[22] [22]

Peft: State-of-the-art parameter-efficient fine-tuning methods

Mangrulkar, S., Gugger, S., Debut, L., Belkada, Y., Paul, S., and Bossan, B. Peft: State-of-the-art parameter-efficient fine-tuning methods. https://github.com/huggingface/peft, 2022

work page 2022

[23] [23]

Effects of parameter norm growth during transformer training: Inductive bias from gradient descent, 2023

Merrill, W., Ramanujan, V., Goldberg, Y., Schwartz, R., and Smith, N. Effects of parameter norm growth during transformer training: Inductive bias from gradient descent, 2023

work page 2023

[24] [24]

and Tadmor, E

Motsch, S. and Tadmor, E. Heterophilious dynamics enhances consensus. 10.100 Review, 56 0 (4): 0 577--621, 2014. ISSN 00361445, 10957200

work page 2014

[25] [25]

P., and Lucchi, A

Noci, L., Anagnostidis, S., Biggio, L., Orvieto, A., Singh, S. P., and Lucchi, A. Signal propagation in transformers: Theoretical perspectives and the role of rank collapse. Advances in Neural Information Processing Systems, 35: 0 27198--27211, 2022

work page 2022

[26] [26]

Computational optimal transport: With applications to data science

Peyr \'e , G., Cuturi, M., et al. Computational optimal transport: With applications to data science. Foundations and Trends in Machine Learning , 11 0 (5-6): 0 355--607, 2019

work page 2019

[27] [27]

and Rossi, F

Piccoli, B. and Rossi, F. Transport equation with nonlocal velocity in wasserstein spaces: Convergence of numerical schemes. 06 2011

work page 2011

[28] [28]

E., Ablin, P., Blondel, M., and Peyr \'e , G

Sander, M. E., Ablin, P., Blondel, M., and Peyr \'e , G. Sinkformers: Transformers with doubly stochastic attention. In International Conference on Artificial Intelligence and Statistics, pp.\ 3515--3530. PMLR, 2022

work page 2022

[29] [29]

Optimal transport for applied mathematicians

Santambrogio, F. Optimal transport for applied mathematicians. Progress in Nonlinear Differential Equations and Their Applications, 2015

work page 2015

[30] [30]

Swarming: hydrodynamic alignment with pressure

Tadmor, E. Swarming: hydrodynamic alignment with pressure. Bulletin of the American Mathematical Society, pp.\ 285–325, 2023

work page 2023

[31] [31]

A., Li, Y., Thrampoulidis, C., and Oymak, S

Tarzanagh, D. A., Li, Y., Thrampoulidis, C., and Oymak, S. Transformers as support vector machines. In NeurIPS 2023 Workshop on Mathematics of Modern Machine Learning, 2023

work page 2023

[32] [32]

Touvron, H., Martin, L., Stone, K., Albert, P., Almahairi, A., Babaei, Y., Bashlykov, N., Batra, S., Bhargava, P., Bhosale, S., Bikel, D., Blecher, L., Ferrer, C. C., Chen, M., Cucurull, G., Esiobu, D., Fernandes, J., Fu, J., Fu, W., Fuller, B., Gao, C., Goswami, V., Goyal, N., Hartshorn, A., Hosseini, S., Hou, R., Inan, H., Kardas, M., Kerkez, V., Khabsa...

work page 2023

[33] [33]

N., Kaiser, ., and Polosukhin, I

Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, ., and Polosukhin, I. Attention is all you need. Advances in neural information processing systems, 30, 2017

work page 2017

[34] [34]

Optimal transport: Old and new

Villani, C. Optimal transport: Old and new. Springer, 2008

work page 2008

[35] [35]

Vuckovic, J., Baratin, A., and des Combes, R. T. A mathematical theory of attention, 2020

work page 2020

[36] [36]

Vuckovic, J., Baratin, A., and des Combes, R. T. On the regularity of attention, 2021

work page 2021

[37] [37]

Orthogonal subspace learning for language model continual learning, 2023

Wang, X., Chen, T., Ge, Q., Xia, H., Bao, R., Zheng, R., Zhang, Q., Gui, T., and Huang, X. Orthogonal subspace learning for language model continual learning, 2023

work page 2023

[38] [38]

A proposal on machine learning via dynamical systems

Weinan, E. A proposal on machine learning via dynamical systems. Communications in Mathematics and Statistics, 1 0 (5): 0 1--11, 2017

work page 2017

[39] [39]

and Bruna, J

Zweig, A. and Bruna, J. A functional perspective on learning symmetric functions with neural networks. In International Conference on Machine Learning, pp.\ 13023--13032. PMLR, 2021

work page 2021