Understanding Catastrophic Forgetting In LoRA via Mean-Field Attention Dynamics
Pith reviewed 2026-05-24 03:30 UTC · model grok-4.3
The pith
A mean-field self-attention model identifies phase transitions that separate forgetting from non-forgetting regimes under LoRA fine-tuning.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
In the mean-field self-attention toy model, tokens evolve as an interacting particle system and LoRA acts as a low-rank perturbation; regimes exist that suggest a phase transition between forgetting and non-forgetting, one transition appearing with respect to the norm of the perturbation and the other with respect to transformer depth. The time-to-deviation is bounded in terms of perturbation size and spectral quantities, and the predicted trends match experiments on real models.
What carries the argument
The mean-field self-attention toy model, in which tokens evolve as an interacting particle system and LoRA is introduced as a low-rank perturbation.
Load-bearing premise
The tractable mean-field self-attention toy model with tokens as an interacting particle system and LoRA as a low-rank perturbation sufficiently captures the essential mechanisms of catastrophic forgetting in real transformer models under LoRA fine-tuning.
What would settle it
Measure forgetting curves on transformers of increasing depth while varying the Frobenius norm of the LoRA update; check whether the location of the observed transition in forgetting rate matches the phase boundary predicted by the mean-field analysis.
Figures
read the original abstract
Low-Rank Adaptation (LoRA) is the dominant parameter-efficient fine-tuning method due to its favorable compute-performance trade-off, yet it suffers from catastrophic forgetting. We study forgetting through a tractable _mean-field self-attention_ toy model, where tokens evolve as an interacting particle system and LoRA acts as a low-rank perturbation. Using tools from partial differential equations and dynamical systems, we characterize regimes suggesting a phase transition between forgetting and non-forgetting behavior. We show that one phase transition appears with respect to the norm of the perturbation, and the other with respect to the depth of the Transformers. We further bound the time-to-deviation in terms of the perturbation size and spectral quantities, and corroborate the predicted trends with experiments and exploratory analyses on real models under LoRA fine-tuning.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces a mean-field self-attention toy model in which tokens evolve as an interacting particle system and LoRA fine-tuning is represented as a low-rank perturbation. Using PDE and dynamical-systems tools, the authors derive regimes indicating phase transitions in forgetting behavior—one with respect to the norm of the perturbation and one with respect to transformer depth—along with bounds on time-to-deviation expressed in terms of perturbation size and spectral quantities. These predictions are said to be corroborated by experiments and exploratory analyses on real transformer models under LoRA fine-tuning.
Significance. If the central claims hold, the work supplies a tractable dynamical-systems framework for analyzing catastrophic forgetting under LoRA, including explicit phase-transition thresholds and time-to-deviation bounds. The explicit use of mean-field limits and spectral quantities to obtain parameter-free regime characterizations is a methodological strength that could guide future mitigation strategies. The provision of real-model experiments that test predicted trends adds empirical grounding inside the manuscript's scope.
major comments (3)
- [§3 (model definition) and §5 (experiments)] The central claims rest on the mean-field particle-system approximation reproducing the dominant attention dynamics that drive forgetting. No section supplies a quantitative comparison (e.g., evolution of the attention matrix or token-interaction statistics) between the continuous toy model and the finite-width, discrete-token behavior of the real transformers used in the experiments; without such a check, the identified transitions remain internal to the toy model.
- [§5] §5 states that real-model experiments 'corroborate the predicted trends.' This is weaker than a direct test of the derived transition thresholds (critical perturbation norm or depth value). The manuscript therefore does not establish that the phase boundaries obtained from the PDE analysis are predictive rather than artifacts of the continuous approximation.
- [§4 (dynamical analysis)] The time-to-deviation bound is expressed in terms of spectral quantities of the attention operator. The manuscript does not characterize how these spectral quantities themselves evolve under the low-rank LoRA update inside the mean-field dynamics, leaving the practical utility of the bound dependent on an unstated assumption of spectral stability.
minor comments (2)
- [§2–3] Notation for the mean-field measure and the low-rank perturbation operator should be introduced with a single consolidated table to avoid repeated re-definition across sections.
- [§5] The experimental section would benefit from an explicit statement of the data-exclusion criteria and the precise definition of 'forgetting' used to generate the reported curves.
Simulated Author's Rebuttal
We thank the referee for the detailed and constructive comments. We address each major comment below with clarifications on the manuscript's scope and indicate planned revisions where we agree changes are warranted.
read point-by-point responses
-
Referee: [§3 (model definition) and §5 (experiments)] The central claims rest on the mean-field particle-system approximation reproducing the dominant attention dynamics that drive forgetting. No section supplies a quantitative comparison (e.g., evolution of the attention matrix or token-interaction statistics) between the continuous toy model and the finite-width, discrete-token behavior of the real transformers used in the experiments; without such a check, the identified transitions remain internal to the toy model.
Authors: We agree that a quantitative comparison of attention dynamics would strengthen the link between the toy model and real transformers. In the revised manuscript we will add a controlled comparison of attention matrix evolution and token-interaction statistics on a small-scale real transformer setup. revision: yes
-
Referee: [§5] §5 states that real-model experiments 'corroborate the predicted trends.' This is weaker than a direct test of the derived transition thresholds (critical perturbation norm or depth value). The manuscript therefore does not establish that the phase boundaries obtained from the PDE analysis are predictive rather than artifacts of the continuous approximation.
Authors: The manuscript's stated scope is to identify phase-transition regimes inside the mean-field model and to show that real LoRA fine-tuning exhibits qualitatively matching trends. We do not claim the exact numerical thresholds are directly predictive on real models. We will revise §5 to clarify this distinction explicitly. revision: partial
-
Referee: [§4 (dynamical analysis)] The time-to-deviation bound is expressed in terms of spectral quantities of the attention operator. The manuscript does not characterize how these spectral quantities themselves evolve under the low-rank LoRA update inside the mean-field dynamics, leaving the practical utility of the bound dependent on an unstated assumption of spectral stability.
Authors: The bound is stated in terms of the initial spectral quantities of the attention operator. We will add a remark in §4 clarifying the assumption of approximate spectral stability for small perturbations and noting the evolution of these quantities as an open question. revision: partial
Circularity Check
No circularity: derivation self-contained within explicitly defined toy model
full rationale
The paper constructs an explicit mean-field self-attention toy model (tokens as interacting particles, LoRA as low-rank perturbation) and applies standard PDE/dynamical-systems analysis to derive phase transitions and time-to-deviation bounds inside that model. These results are mathematical consequences of the stated assumptions and equations rather than reductions to fitted data, self-citations, or renamed empirical patterns. Real-model experiments are described only as corroborating trends, not as inputs that define or force the predictions. No load-bearing step reduces by construction to its own inputs.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Tokens evolve as an interacting particle system under mean-field self-attention.
Forward citations
Cited by 2 Pith papers
-
Perceptrons and localization of attention's mean-field landscape
In the mean-field limit of attention with perceptron blocks, critical points of the energy landscape are generically atomic and localized on subsets of the unit sphere.
-
Quantitative Clustering in Mean-Field Transformer Models
Mean-field transformer models synchronize to a Dirac point mass exponentially fast with explicit quantitative rates under suitable parameter assumptions.
Reference graph
Works this paper leans on
-
[1]
" write newline "" before.all 'output.state := FUNCTION n.dashify 't := "" t empty not t #1 #1 substring "-" = t #1 #2 substring "--" = not "--" * t #2 global.max substring 't := t #1 #1 substring "-" = "-" * t #2 global.max substring 't := while if t #1 #1 substring * t #2 global.max substring 't := if while FUNCTION format.date year duplicate empty "emp...
-
[2]
Gradient flows: in metric spaces and in the space of probability measures
Ambrosio, L., Gigli, N., and Savar \'e , G. Gradient flows: in metric spaces and in the space of probability measures. Springer Science & Business Media, 2005
work page 2005
-
[3]
I., Afrajmovich, V., Il'yashenko, Y
Arnold, V. I., Afrajmovich, V., Il'yashenko, Y. S., and Shil'nikov, L. Dynamical systems V: bifurcation theory and catastrophe theory, volume 5. Springer Science & Business Media, 2013
work page 2013
-
[4]
Transformers learn through gradual rank increase
Boix-Adsera, E., Littwin, E., Abbe, E., Bengio, S., and Susskind, J. Transformers learn through gradual rank increase. Advances in neural information processing systems, 36, 2024
work page 2024
-
[5]
Concentration Inequalities: A Nonasymptotic Theory of Independence
Boucheron, S., Lugosi, G., and Massart, P. Concentration Inequalities: A Nonasymptotic Theory of Independence. Oxford Press University, 2013
work page 2013
-
[6]
Understanding the regularity of self-attention with optimal transport, 2023
Castin, V., Ablin, P., and Peyré, G. Understanding the regularity of self-attention with optimal transport, 2023
work page 2023
-
[7]
Q., Rubanova, Y., Bettencourt, J., and Duvenaud, D
Chen, T. Q., Rubanova, Y., Bettencourt, J., and Duvenaud, D. Neural ordinary differential equations. In Bengio, S., Wallach, H. M., Larochelle, H., Grauman, K., Cesa - Bianchi, N., and Garnett, R. (eds.), Advances in Neural Information Processing Systems 31: Annual Conference on Neural Information Processing Systems 2018, NeurIPS 2018, December 3-8, 2018,...
work page 2018
-
[8]
De Bie, G., Peyr \'e , G., and Cuturi, M. Stochastic deep networks. In International Conference on Machine Learning, pp.\ 1556--1565. PMLR, 2019
work page 2019
-
[9]
QL o RA : Efficient finetuning of quantized LLM s
Dettmers, T., Pagnoni, A., Holtzman, A., and Zettlemoyer, L. QL o RA : Efficient finetuning of quantized LLM s. In Thirty-seventh Conference on Neural Information Processing Systems, 2023
work page 2023
-
[10]
Attention is not all you need: Pure attention loses rank doubly exponentially with depth
Dong, Y., Cordonnier, J.-B., and Loukas, A. Attention is not all you need: Pure attention loses rank doubly exponentially with depth. In International Conference on Machine Learning, pp.\ 2793--2803. PMLR, 2021
work page 2021
-
[11]
A mathematical perspective on transformers, 2023
G eshkovski, B., L etrouit, C., P olyanskiy, Y., and R igollet, P. A mathematical perspective on transformers, 2023
work page 2023
-
[12]
The emergence of clusters in self-attention dynamics
Geshkovski, B., Letrouit, C., Polyanskiy, Y., and Rigollet, P. The emergence of clusters in self-attention dynamics. Advances in neural information processing systems, 36, 2024
work page 2024
-
[13]
Stable architectures for deep neural networks
Haber, E. and Ruthotto, L. Stable architectures for deep neural networks. Inverse Problems, 34 0 (1): 0 014004, dec 2017. ISSN 1361-6420. doi:10.1088/1361-6420/aa9a90
-
[14]
Deep residual learning for image recognition
He, K., Zhang, X., Ren, S., and Sun, J. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp.\ 770--778, 2016
work page 2016
-
[15]
J., Shen, Y., Wallis, P., Allen - Zhu, Z., Li, Y., Wang, S., Wang, L., and Chen, W
Hu, E. J., Shen, Y., Wallis, P., Allen - Zhu, Z., Li, Y., Wang, S., Wang, L., and Chen, W. Lora: Low-rank adaptation of large language models. In The Tenth International Conference on Learning Representations, ICLR 2022, Virtual Event, April 25-29, 2022 . OpenReview.net, 2022
work page 2022
-
[16]
Jabin, P.-E. and Motsch, S. Clustering and asymptotic behavior in opinion formation. Journal of Differential Equations, 257 0 (11): 0 4165--4187, 2014. ISSN 0022-0396. doi:https://doi.org/10.1016/j.jde.2014.08.005
-
[17]
Q., Sablayrolles, A., Mensch, A., Bamford, C., Chaplot, D
Jiang, A. Q., Sablayrolles, A., Mensch, A., Bamford, C., Chaplot, D. S., de las Casas, D., Bressand, F., Lengyel, G., Lample, G., Saulnier, L., Lavaud, L. R., Lachaux, M.-A., Stock, P., Scao, T. L., Lavril, T., Wang, T., Lacroix, T., and Sayed, W. E. Mistral 7b, 2023
work page 2023
-
[18]
The lipschitz constant of self-attention
Kim, H., Papamakarios, G., and Mnih, A. The lipschitz constant of self-attention. In Meila, M. and Zhang, T. (eds.), Proceedings of the 38th International Conference on Machine Learning, ICML 2021, 18-24 July 2021, Virtual Event , volume 139 of Proceedings of Machine Learning Research, pp.\ 5562--5571. PMLR , 2021
work page 2021
-
[19]
Self-entrainment of a population of coupled non-linear oscillators
Kuramoto, Y. Self-entrainment of a population of coupled non-linear oscillators. In Araki, H. (ed.), International Symposium on Mathematical Problems in Theoretical Physics, pp.\ 420--422, Berlin, Heidelberg, 1975. Springer Berlin Heidelberg. ISBN 978-3-540-37509-8
work page 1975
-
[20]
A discrete nonlinear and non-autonomous model of consensus formation
Kurause, U. A discrete nonlinear and non-autonomous model of consensus formation. Communications in Difference Equations, 2000
work page 2000
-
[21]
Understanding and improving transformer from a multi-particle dynamic system point of view, 2019
Lu, Y., Li, Z., He, D., Sun, Z., Dong, B., Qin, T., Wang, L., and Liu, T.-Y. Understanding and improving transformer from a multi-particle dynamic system point of view, 2019
work page 2019
-
[22]
Peft: State-of-the-art parameter-efficient fine-tuning methods
Mangrulkar, S., Gugger, S., Debut, L., Belkada, Y., Paul, S., and Bossan, B. Peft: State-of-the-art parameter-efficient fine-tuning methods. https://github.com/huggingface/peft, 2022
work page 2022
-
[23]
Merrill, W., Ramanujan, V., Goldberg, Y., Schwartz, R., and Smith, N. Effects of parameter norm growth during transformer training: Inductive bias from gradient descent, 2023
work page 2023
-
[24]
Motsch, S. and Tadmor, E. Heterophilious dynamics enhances consensus. 10.100 Review, 56 0 (4): 0 577--621, 2014. ISSN 00361445, 10957200
work page 2014
-
[25]
Noci, L., Anagnostidis, S., Biggio, L., Orvieto, A., Singh, S. P., and Lucchi, A. Signal propagation in transformers: Theoretical perspectives and the role of rank collapse. Advances in Neural Information Processing Systems, 35: 0 27198--27211, 2022
work page 2022
-
[26]
Computational optimal transport: With applications to data science
Peyr \'e , G., Cuturi, M., et al. Computational optimal transport: With applications to data science. Foundations and Trends in Machine Learning , 11 0 (5-6): 0 355--607, 2019
work page 2019
-
[27]
Piccoli, B. and Rossi, F. Transport equation with nonlocal velocity in wasserstein spaces: Convergence of numerical schemes. 06 2011
work page 2011
-
[28]
E., Ablin, P., Blondel, M., and Peyr \'e , G
Sander, M. E., Ablin, P., Blondel, M., and Peyr \'e , G. Sinkformers: Transformers with doubly stochastic attention. In International Conference on Artificial Intelligence and Statistics, pp.\ 3515--3530. PMLR, 2022
work page 2022
-
[29]
Optimal transport for applied mathematicians
Santambrogio, F. Optimal transport for applied mathematicians. Progress in Nonlinear Differential Equations and Their Applications, 2015
work page 2015
-
[30]
Swarming: hydrodynamic alignment with pressure
Tadmor, E. Swarming: hydrodynamic alignment with pressure. Bulletin of the American Mathematical Society, pp.\ 285–325, 2023
work page 2023
-
[31]
A., Li, Y., Thrampoulidis, C., and Oymak, S
Tarzanagh, D. A., Li, Y., Thrampoulidis, C., and Oymak, S. Transformers as support vector machines. In NeurIPS 2023 Workshop on Mathematics of Modern Machine Learning, 2023
work page 2023
-
[32]
Touvron, H., Martin, L., Stone, K., Albert, P., Almahairi, A., Babaei, Y., Bashlykov, N., Batra, S., Bhargava, P., Bhosale, S., Bikel, D., Blecher, L., Ferrer, C. C., Chen, M., Cucurull, G., Esiobu, D., Fernandes, J., Fu, J., Fu, W., Fuller, B., Gao, C., Goswami, V., Goyal, N., Hartshorn, A., Hosseini, S., Hou, R., Inan, H., Kardas, M., Kerkez, V., Khabsa...
work page 2023
-
[33]
N., Kaiser, ., and Polosukhin, I
Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, ., and Polosukhin, I. Attention is all you need. Advances in neural information processing systems, 30, 2017
work page 2017
-
[34]
Optimal transport: Old and new
Villani, C. Optimal transport: Old and new. Springer, 2008
work page 2008
-
[35]
Vuckovic, J., Baratin, A., and des Combes, R. T. A mathematical theory of attention, 2020
work page 2020
-
[36]
Vuckovic, J., Baratin, A., and des Combes, R. T. On the regularity of attention, 2021
work page 2021
-
[37]
Orthogonal subspace learning for language model continual learning, 2023
Wang, X., Chen, T., Ge, Q., Xia, H., Bao, R., Zheng, R., Zhang, Q., Gui, T., and Huang, X. Orthogonal subspace learning for language model continual learning, 2023
work page 2023
-
[38]
A proposal on machine learning via dynamical systems
Weinan, E. A proposal on machine learning via dynamical systems. Communications in Mathematics and Statistics, 1 0 (5): 0 1--11, 2017
work page 2017
-
[39]
Zweig, A. and Bruna, J. A functional perspective on learning symmetric functions with neural networks. In International Conference on Machine Learning, pp.\ 13023--13032. PMLR, 2021
work page 2021
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.