Large-Step Training Dynamics of a Two-Factor Linear Transformer Model
Pith reviewed 2026-05-21 03:46 UTC · model grok-4.3
The pith
Large constant learning rates can shift linear transformer training from in-context regression to cycles, chaos or divergence.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Gradient-flow analyses show that simplified linear transformers can learn the in-context linear-regression algorithm, but they do not explain the finite-step behavior of gradient descent at large learning rates. Motivated by empirical work on high-learning-rate transformer instabilities and by the cubic-map phase diagram for quadratic regression, we study an exactly reducible one-prompt linear-transformer training problem. After normalization, the dynamics reduce to a two-factor product map with an effective step-size parameter mu. On the balanced slice, this map recovers the known scalar cubic transition from monotone convergence to catapult convergence, periodic and chaotic bounded noncon
What carries the argument
The two-factor product map obtained after normalization of the one-prompt linear-transformer training dynamics, which allows reduction to a scalar cubic map on the balanced slice and admits an explicit invariant Chebyshev ellipse in the full 2D system.
Load-bearing premise
After normalization, the one-prompt linear-transformer training dynamics reduce exactly to a two-factor product map with effective step-size mu.
What would settle it
Running numerical simulations of the gradient descent updates for the linear transformer with effective step sizes mu just above and below the predicted stability thresholds and checking if the trajectories enter periodic cycles or diverge as forecasted by the cubic map.
Figures
read the original abstract
Gradient-flow analyses show that simplified linear transformers can learn the in-context linear-regression algorithm, but they do not explain the finite-step behavior of gradient descent at large learning rates. Motivated by empirical work on high-learning-rate transformer instabilities and by the cubic-map phase diagram for quadratic regression, we study an exactly reducible one-prompt linear-transformer training problem. After normalization, the dynamics reduce to a two-factor product map with an effective step-size parameter \(\mu\). On the balanced slice, this map recovers the known scalar cubic transition from monotone convergence to catapult convergence, periodic and chaotic bounded nonconvergence, and divergence. We then analyze the full two-dimensional system and show that, for \(0<\mu<2\), it has an explicit invariant Chebyshev ellipse separating forward-invariant regions; this ellipse carries off-balanced chaotic dynamics but is transversely repelling, while balanced scalar attractors can be transversely attracting. These results show that large constant learning rates can change the training attractor of the learned transformer rather than merely accelerating convergence: beyond sharp stability thresholds, finite-step training may settle into cycles, bounded chaos, or divergence instead of a single in-context linear-regression solution. We also discuss the consequences for mini-batch gradient descent based training methods.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper analyzes the finite-step gradient descent dynamics of a simplified one-prompt linear transformer model. It claims that after a normalization step, the training dynamics exactly reduce to a two-factor product map parameterized by an effective step size μ. On the balanced slice, this map reproduces the phase diagram of the scalar cubic map, including transitions to periodic, chaotic, and divergent behaviors. For the full 2D system, an explicit invariant Chebyshev ellipse is identified for 0 < μ < 2, which is transversely repelling while balanced attractors can be attracting. The results imply that large constant learning rates can alter the training attractor away from the in-context linear regression solution towards cycles, bounded chaos, or divergence.
Significance. If the exact reduction holds, this provides a rigorous mathematical framework for understanding how large learning rates affect transformer training dynamics beyond the gradient-flow limit. It offers explicit stability thresholds and an invariant set analysis that could explain empirical instabilities in high-LR training. The explicit construction of the invariant ellipse and the transverse stability analysis are notable strengths, as is the connection to the known cubic map phenomenology without post-hoc fitting. This could inform the design of training schedules for transformers.
major comments (2)
- [Abstract and model/normalization section] Abstract and the normalization step in the model section: the claim that the one-prompt linear-transformer gradient-descent updates reduce exactly to the two-factor product map with effective step-size μ is load-bearing for all stability thresholds and attractor conclusions, yet the algebraic cancellations that achieve this reduction after normalization are not shown explicitly; it is therefore unclear whether the reduction is exact for arbitrary prompt statistics or requires additional assumptions on data or initialization.
- [Full 2D system analysis] The 2D system analysis section: the assertion of an explicit invariant Chebyshev ellipse for 0<μ<2 that separates forward-invariant regions and is transversely repelling requires the explicit verification that the ellipse is mapped into itself and the computation of the transverse Lyapunov exponent or linearization; without these steps the conclusion that off-balanced chaotic dynamics do not attract from the balanced slice does not follow.
minor comments (2)
- [Notation and parameters] The relation between the original learning rate and the effective parameter μ should be written as a single displayed equation immediately after the normalization is introduced.
- [Figures] Phase portraits or bifurcation diagrams for the 2D map would benefit from explicit annotation of the Chebyshev ellipse and the basins of attraction.
Simulated Author's Rebuttal
We thank the referee for their thorough review and valuable feedback on our work. We address the major comments below and will incorporate the suggested clarifications in the revised manuscript.
read point-by-point responses
-
Referee: [Abstract and model/normalization section] Abstract and the normalization step in the model section: the claim that the one-prompt linear-transformer gradient-descent updates reduce exactly to the two-factor product map with effective step-size μ is load-bearing for all stability thresholds and attractor conclusions, yet the algebraic cancellations that achieve this reduction after normalization are not shown explicitly; it is therefore unclear whether the reduction is exact for arbitrary prompt statistics or requires additional assumptions on data or initialization.
Authors: We appreciate this observation. Upon review, we recognize that while the reduction is derived in the manuscript, the intermediate algebraic steps were not presented in full detail. In the revised version, we will expand the model section to explicitly show the cancellations leading to the two-factor product map. This derivation holds for arbitrary prompt statistics under the normalization procedure described, without further assumptions on data or initialization beyond those stated in the paper. revision: yes
-
Referee: [Full 2D system analysis] The 2D system analysis section: the assertion of an explicit invariant Chebyshev ellipse for 0<μ<2 that separates forward-invariant regions and is transversely repelling requires the explicit verification that the ellipse is mapped into itself and the computation of the transverse Lyapunov exponent or linearization; without these steps the conclusion that off-balanced chaotic dynamics do not attract from the balanced slice does not follow.
Authors: We agree that a more explicit verification is necessary to fully support the claims regarding the invariant ellipse. In the updated manuscript, we will provide the step-by-step verification that the Chebyshev ellipse is mapped into itself under the dynamics for 0 < μ < 2. Additionally, we will include the computation of the transverse linearization and the associated Lyapunov exponent to rigorously demonstrate that the ellipse is transversely repelling, while balanced attractors remain attracting in the transverse direction. This will strengthen the conclusion that off-balanced chaotic dynamics do not attract trajectories starting from the balanced slice. revision: yes
Circularity Check
Derivation chain is self-contained; reduction to product map follows from model equations without circular reduction
full rationale
The paper starts from the one-prompt linear-transformer loss and gradient-descent updates, applies a stated normalization, and derives the two-factor product map with parameter μ directly from those equations. The balanced-slice recovery of the cubic map is presented as a known scalar case recovered by restriction, not as a fitted or self-defined prediction. The Chebyshev ellipse is constructed explicitly as an invariant set for the 2D flow. No load-bearing self-citation, no post-hoc fitting renamed as prediction, and no ansatz smuggled via prior work are required for the central stability thresholds or attractor-change claims. The derivation remains independent of its target conclusions.
Axiom & Free-Parameter Ledger
free parameters (1)
- mu
axioms (1)
- domain assumption The one-prompt linear-transformer training problem is exactly reducible to a two-factor product map after normalization.
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
After normalization, the dynamics reduce to a two-factor product map with an effective step-size parameter μ. On the balanced slice, this map recovers the known scalar cubic transition... the full two-dimensional system... has an explicit invariant Chebyshev ellipse... e+ = C(e) = e³−3e
-
IndisputableMonolith/Foundation/ArithmeticFromLogic.leanLogicNat recovery and embed_strictMono unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
the map Φ_μ(a,b)=(a−(ab−μ)b,b−(ab−μ)a)
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
A. Agarwala, F. Pedregosa, and J. Pennington. Second-order regression models exhibit progressive sharpening to the edge of stability. In International Conference on Machine Learning, pages 169--195. PMLR, 2023
work page 2023
- [2]
-
[3]
L. Chen and J. Bruna. Beyond the edge of stability via two-step gradient updates. In International Conference on Machine Learning, pages 4330--4391. PMLR, 2023
work page 2023
-
[4]
X. Chen, K. Balasubramanian, P. Ghosal, and B. K. Agrawalla. From stability to chaos: Analyzing gradient descent dynamics in quadratic regression. Transactions on Machine Learning Research, 2024. ISSN 2835-8856. URL https://openreview.net/forum?id=Wiklo5VpG7
work page 2024
- [5]
- [6]
-
[7]
R. L. Devaney. An Introduction to Chaotic Dynamical Systems. Westview Press, Boulder, CO, 2 edition, 2003
work page 2003
-
[8]
J. Gilmer, B. Ghorbani, A. Garg, S. Kudugunta, B. Neyshabur, D. Cardoze, G. E. Dahl, Z. Nado, and O. Firat. A loss curvature perspective on training instabilities of deep learning models. In International Conference on Learning Representations, 2022. URL https://openreview.net/forum?id=OcKMT-36vUs
work page 2022
-
[9]
L. Herrmann, M. Granz, and T. Landgraf. Chaotic dynamics are intrinsic to neural network training with sgd. Advances in Neural Information Processing Systems, 35: 0 5219--5229, 2022
work page 2022
-
[10]
M. Kodryan, E. Lobacheva, M. Nakhodnov, and D. Vetrov. Training scale-invariant neural networks on the sphere can happen in three regimes. Advances in Neural Information Processing Systems, 35: 0 14058--14070, 2022
work page 2022
-
[11]
L. Kong and M. Tao. Stochasticity of deterministic gradient descent: Large learning rate for multiscale objective function. Advances in neural information processing systems, 33: 0 2625--2638, 2020
work page 2020
-
[12]
The large learning rate phase of deep learning: the catapult mechanism
A. Lewkowycz, Y. Bahri, E. Dyer, J. Sohl-Dickstein, and G. Gur-Ari. The large learning rate phase of deep learning: the catapult mechanism. arXiv preprint arXiv:2003.02218, 2020
-
[13]
S. Liang and G. Montufar. Gradient descent with large step sizes: Chaos and fractal convergence region. In The Fourteenth International Conference on Learning Representations, 2026. URL https://openreview.net/forum?id=wsxGCaBjWC
work page 2026
-
[14]
E. Lobacheva, M. Kodryan, N. Chirkova, A. Malinin, and D. P. Vetrov. On the periodic behavior of neural network training with batch normalization and weight decay. Advances in Neural Information Processing Systems, 34: 0 21545--21556, 2021
work page 2021
-
[15]
J. W. Milnor. Remarks on iterated cubic maps. Experimental Mathematics, 1 0 (1): 0 5--24, 1992. doi:10.1080/10586458.1992.10504242
-
[16]
J. W. Milnor and C. Tresser. On entropy and monotonicity for real cubic maps. Communications in Mathematical Physics, 209 0 (1): 0 123--178, 2000. doi:10.1007/s002200050018. With an appendix by Adrien Douady and Pierrette Sentenac
-
[17]
M. Song and C. Yun. Trajectory alignment: Understanding the edge of stability phenomenon via bifurcation theory. In 37th Annual Conference on Neural Information Processing Systems. Neural Information Processing Systems, 2023
work page 2023
-
[18]
Y. Wang, M. Chen, T. Zhao, and M. Tao. Large learning rate tames homogeneity: Convergence and balancing effect. In International Conference on Learning Representations, 2022. URL https://openreview.net/forum?id=3tbDrs77LJ5
work page 2022
-
[19]
S. Wiggins. Introduction to Applied Nonlinear Dynamical Systems and Chaos, volume 2 of Texts in Applied Mathematics. Springer, New York, 2 edition, 2003
work page 2003
-
[20]
M. Wortsman, P. J. Liu, L. Xiao, K. E. Everett, A. A. Alemi, B. Adlam, J. D. Co-Reyes, I. Gur, A. Kumar, R. Novak, J. Pennington, J. Sohl-Dickstein, K. Xu, J. Lee, J. Gilmer, and S. Kornblith. Small-scale proxies for large-scale transformer training instabilities. In International Conference on Learning Representations (ICLR), 2024
work page 2024
-
[21]
J. Wu, P. L. Bartlett, M. Telgarsky, and B. Yu. Large stepsize gradient descent for logistic loss: Non-monotonicity of the loss improves optimization efficiency. In The Thirty Seventh Annual Conference on Learning Theory, pages 5019--5073. PMLR, 2024 a
work page 2024
-
[22]
J. Wu, D. Zou, Z. Chen, V. Braverman, Q. Gu, and P. Bartlett. How many pretraining tasks are needed for in-context learning of linear regression? In The Twelfth International Conference on Learning Representations, 2024 b . URL https://openreview.net/forum?id=vSh5ePa0ph
work page 2024
- [23]
- [24]
- [25]
- [26]
- [27]
-
[28]
L. Zhu, C. Liu, A. Radhakrishnan, and M. Belkin. Quadratic models for understanding catapult dynamics of neural networks. In The Twelfth International Conference on Learning Representations, 2024. URL https://openreview.net/forum?id=PvJnX3dwsD
work page 2024
-
[29]
X. Zhu, Z. Wang, X. Wang, M. Zhou, and R. Ge. Understanding edge-of-stability training dynamics with a minimalist example. In The Eleventh International Conference on Learning Representations, 2023. URL https://openreview.net/forum?id=p7EagBsMAEO
work page 2023
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.