pith. sign in

arxiv: 2605.21292 · v1 · pith:DZW2CLWLnew · submitted 2026-05-20 · 📊 stat.ML · cs.AI· cs.LG· math.DS

Large-Step Training Dynamics of a Two-Factor Linear Transformer Model

Pith reviewed 2026-05-21 03:46 UTC · model grok-4.3

classification 📊 stat.ML cs.AIcs.LGmath.DS
keywords linear transformerstraining dynamicslarge learning ratesgradient descentchaotic dynamicsin-context learningphase transitionsattractors
0
0 comments X

The pith

Large constant learning rates can shift linear transformer training from in-context regression to cycles, chaos or divergence.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper investigates the training dynamics of a simplified one-prompt linear transformer using gradient descent with large constant learning rates. By normalizing the problem, it reduces the dynamics to a two-factor product map parameterized by an effective step size mu. Analysis of this map on balanced and full two-dimensional slices reveals transitions from convergence to periodic behavior, bounded chaos, and divergence as mu increases. This is important because it indicates that large learning rates do not simply accelerate training but can alter the final learned behavior or prevent convergence altogether in transformer models.

Core claim

Gradient-flow analyses show that simplified linear transformers can learn the in-context linear-regression algorithm, but they do not explain the finite-step behavior of gradient descent at large learning rates. Motivated by empirical work on high-learning-rate transformer instabilities and by the cubic-map phase diagram for quadratic regression, we study an exactly reducible one-prompt linear-transformer training problem. After normalization, the dynamics reduce to a two-factor product map with an effective step-size parameter mu. On the balanced slice, this map recovers the known scalar cubic transition from monotone convergence to catapult convergence, periodic and chaotic bounded noncon

What carries the argument

The two-factor product map obtained after normalization of the one-prompt linear-transformer training dynamics, which allows reduction to a scalar cubic map on the balanced slice and admits an explicit invariant Chebyshev ellipse in the full 2D system.

Load-bearing premise

After normalization, the one-prompt linear-transformer training dynamics reduce exactly to a two-factor product map with effective step-size mu.

What would settle it

Running numerical simulations of the gradient descent updates for the linear transformer with effective step sizes mu just above and below the predicted stability thresholds and checking if the trajectories enter periodic cycles or diverge as forecasted by the cubic map.

Figures

Figures reproduced from arXiv: 2605.21292 by Krishnakumar Balasubramanian.

Figure 1
Figure 1. Figure 1: Numerical verification of Proposition 2.1. For each [PITH_FULL_IMAGE:figures/full_fig_p041_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Balanced-line bifurcation diagram for Fµ. Each column is a single GD run on ℓµ from a balanced initial condition; points are the asymptotic error after burn-in. Vertical dashed lines are the four analytic thresholds 2 √ 2 − 2 ≈ 0.83, 1, √ 5 − 1 ≈ 1.24, 2. Divergent µ values are marked as red ticks at the bottom. 42 [PITH_FULL_IMAGE:figures/full_fig_p042_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Phase portraits of Φµ. Four initial conditions per panel (black-bordered dots labelled A–D), two inside and two outside the Chebyshev ellipse Eµ. Arrowheads show iteration direction. Interior orbits converge to the zero-error hyperbola Mµ at µ = 0.7 (left) and to a period-two cycle at µ = 1.3 (right); exterior orbits diverge in both panels. 3 2 1 0 1 2 3 a 3 2 1 0 1 2 3 b Trajectories from the full-batch s… view at source ↗
Figure 4
Figure 4. Figure 4: A single mini-batch can cross the full-batch separatrix. Left: [PITH_FULL_IMAGE:figures/full_fig_p043_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Transverse Lyapunov exponent of the balanced line under stochastic batch switching, [PITH_FULL_IMAGE:figures/full_fig_p044_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Full-LSA multi-prompt mini-batch training. Left: population loss. Middle: instantaneous [PITH_FULL_IMAGE:figures/full_fig_p044_6.png] view at source ↗
read the original abstract

Gradient-flow analyses show that simplified linear transformers can learn the in-context linear-regression algorithm, but they do not explain the finite-step behavior of gradient descent at large learning rates. Motivated by empirical work on high-learning-rate transformer instabilities and by the cubic-map phase diagram for quadratic regression, we study an exactly reducible one-prompt linear-transformer training problem. After normalization, the dynamics reduce to a two-factor product map with an effective step-size parameter \(\mu\). On the balanced slice, this map recovers the known scalar cubic transition from monotone convergence to catapult convergence, periodic and chaotic bounded nonconvergence, and divergence. We then analyze the full two-dimensional system and show that, for \(0<\mu<2\), it has an explicit invariant Chebyshev ellipse separating forward-invariant regions; this ellipse carries off-balanced chaotic dynamics but is transversely repelling, while balanced scalar attractors can be transversely attracting. These results show that large constant learning rates can change the training attractor of the learned transformer rather than merely accelerating convergence: beyond sharp stability thresholds, finite-step training may settle into cycles, bounded chaos, or divergence instead of a single in-context linear-regression solution. We also discuss the consequences for mini-batch gradient descent based training methods.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper analyzes the finite-step gradient descent dynamics of a simplified one-prompt linear transformer model. It claims that after a normalization step, the training dynamics exactly reduce to a two-factor product map parameterized by an effective step size μ. On the balanced slice, this map reproduces the phase diagram of the scalar cubic map, including transitions to periodic, chaotic, and divergent behaviors. For the full 2D system, an explicit invariant Chebyshev ellipse is identified for 0 < μ < 2, which is transversely repelling while balanced attractors can be attracting. The results imply that large constant learning rates can alter the training attractor away from the in-context linear regression solution towards cycles, bounded chaos, or divergence.

Significance. If the exact reduction holds, this provides a rigorous mathematical framework for understanding how large learning rates affect transformer training dynamics beyond the gradient-flow limit. It offers explicit stability thresholds and an invariant set analysis that could explain empirical instabilities in high-LR training. The explicit construction of the invariant ellipse and the transverse stability analysis are notable strengths, as is the connection to the known cubic map phenomenology without post-hoc fitting. This could inform the design of training schedules for transformers.

major comments (2)
  1. [Abstract and model/normalization section] Abstract and the normalization step in the model section: the claim that the one-prompt linear-transformer gradient-descent updates reduce exactly to the two-factor product map with effective step-size μ is load-bearing for all stability thresholds and attractor conclusions, yet the algebraic cancellations that achieve this reduction after normalization are not shown explicitly; it is therefore unclear whether the reduction is exact for arbitrary prompt statistics or requires additional assumptions on data or initialization.
  2. [Full 2D system analysis] The 2D system analysis section: the assertion of an explicit invariant Chebyshev ellipse for 0<μ<2 that separates forward-invariant regions and is transversely repelling requires the explicit verification that the ellipse is mapped into itself and the computation of the transverse Lyapunov exponent or linearization; without these steps the conclusion that off-balanced chaotic dynamics do not attract from the balanced slice does not follow.
minor comments (2)
  1. [Notation and parameters] The relation between the original learning rate and the effective parameter μ should be written as a single displayed equation immediately after the normalization is introduced.
  2. [Figures] Phase portraits or bifurcation diagrams for the 2D map would benefit from explicit annotation of the Chebyshev ellipse and the basins of attraction.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their thorough review and valuable feedback on our work. We address the major comments below and will incorporate the suggested clarifications in the revised manuscript.

read point-by-point responses
  1. Referee: [Abstract and model/normalization section] Abstract and the normalization step in the model section: the claim that the one-prompt linear-transformer gradient-descent updates reduce exactly to the two-factor product map with effective step-size μ is load-bearing for all stability thresholds and attractor conclusions, yet the algebraic cancellations that achieve this reduction after normalization are not shown explicitly; it is therefore unclear whether the reduction is exact for arbitrary prompt statistics or requires additional assumptions on data or initialization.

    Authors: We appreciate this observation. Upon review, we recognize that while the reduction is derived in the manuscript, the intermediate algebraic steps were not presented in full detail. In the revised version, we will expand the model section to explicitly show the cancellations leading to the two-factor product map. This derivation holds for arbitrary prompt statistics under the normalization procedure described, without further assumptions on data or initialization beyond those stated in the paper. revision: yes

  2. Referee: [Full 2D system analysis] The 2D system analysis section: the assertion of an explicit invariant Chebyshev ellipse for 0<μ<2 that separates forward-invariant regions and is transversely repelling requires the explicit verification that the ellipse is mapped into itself and the computation of the transverse Lyapunov exponent or linearization; without these steps the conclusion that off-balanced chaotic dynamics do not attract from the balanced slice does not follow.

    Authors: We agree that a more explicit verification is necessary to fully support the claims regarding the invariant ellipse. In the updated manuscript, we will provide the step-by-step verification that the Chebyshev ellipse is mapped into itself under the dynamics for 0 < μ < 2. Additionally, we will include the computation of the transverse linearization and the associated Lyapunov exponent to rigorously demonstrate that the ellipse is transversely repelling, while balanced attractors remain attracting in the transverse direction. This will strengthen the conclusion that off-balanced chaotic dynamics do not attract trajectories starting from the balanced slice. revision: yes

Circularity Check

0 steps flagged

Derivation chain is self-contained; reduction to product map follows from model equations without circular reduction

full rationale

The paper starts from the one-prompt linear-transformer loss and gradient-descent updates, applies a stated normalization, and derives the two-factor product map with parameter μ directly from those equations. The balanced-slice recovery of the cubic map is presented as a known scalar case recovered by restriction, not as a fitted or self-defined prediction. The Chebyshev ellipse is constructed explicitly as an invariant set for the 2D flow. No load-bearing self-citation, no post-hoc fitting renamed as prediction, and no ansatz smuggled via prior work are required for the central stability thresholds or attractor-change claims. The derivation remains independent of its target conclusions.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The central claim rests on the exact reducibility of the transformer training dynamics to a two-factor product map after normalization and on the existence of an invariant Chebyshev ellipse whose transverse stability properties determine the attractors.

free parameters (1)
  • mu
    Effective step-size parameter that controls the bifurcation structure of the reduced map.
axioms (1)
  • domain assumption The one-prompt linear-transformer training problem is exactly reducible to a two-factor product map after normalization.
    This reduction is invoked at the outset to enable all subsequent dynamical analysis.

pith-pipeline@v0.9.0 · 5753 in / 1398 out tokens · 63236 ms · 2026-05-21T03:46:19.481174+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

29 extracted references · 29 canonical work pages

  1. [1]

    Agarwala, F

    A. Agarwala, F. Pedregosa, and J. Pennington. Second-order regression models exhibit progressive sharpening to the edge of stability. In International Conference on Machine Learning, pages 169--195. PMLR, 2023

  2. [2]

    Arora, Z

    S. Arora, Z. Li, and A. Panigrahi. Understanding gradient descent on the edge of stability in deep learning. In International Conference on Machine Learning, pages 948--1024. PMLR, 2022

  3. [3]

    Chen and J

    L. Chen and J. Bruna. Beyond the edge of stability via two-step gradient updates. In International Conference on Machine Learning, pages 4330--4391. PMLR, 2023

  4. [4]

    X. Chen, K. Balasubramanian, P. Ghosal, and B. K. Agrawalla. From stability to chaos: Analyzing gradient descent dynamics in quadratic regression. Transactions on Machine Learning Research, 2024. ISSN 2835-8856. URL https://openreview.net/forum?id=Wiklo5VpG7

  5. [5]

    Cohen, S

    J. Cohen, S. Kaur, Y. Li, J. Z. Kolter, and A. Talwalkar. Gradient descent on neural networks typically occurs at the edge of stability. In International Conference on Learning Representations, 2021. URL https://openreview.net/forum?id=jh-rTtvkGeM

  6. [6]

    Damian, E

    A. Damian, E. Nichani, and J. D. Lee. Self-stabilization: The implicit bias of gradient descent at the edge of stability. ICLR 2023, 2023

  7. [7]

    R. L. Devaney. An Introduction to Chaotic Dynamical Systems. Westview Press, Boulder, CO, 2 edition, 2003

  8. [8]

    Gilmer, B

    J. Gilmer, B. Ghorbani, A. Garg, S. Kudugunta, B. Neyshabur, D. Cardoze, G. E. Dahl, Z. Nado, and O. Firat. A loss curvature perspective on training instabilities of deep learning models. In International Conference on Learning Representations, 2022. URL https://openreview.net/forum?id=OcKMT-36vUs

  9. [9]

    Herrmann, M

    L. Herrmann, M. Granz, and T. Landgraf. Chaotic dynamics are intrinsic to neural network training with sgd. Advances in Neural Information Processing Systems, 35: 0 5219--5229, 2022

  10. [10]

    Kodryan, E

    M. Kodryan, E. Lobacheva, M. Nakhodnov, and D. Vetrov. Training scale-invariant neural networks on the sphere can happen in three regimes. Advances in Neural Information Processing Systems, 35: 0 14058--14070, 2022

  11. [11]

    Kong and M

    L. Kong and M. Tao. Stochasticity of deterministic gradient descent: Large learning rate for multiscale objective function. Advances in neural information processing systems, 33: 0 2625--2638, 2020

  12. [12]

    The large learning rate phase of deep learning: the catapult mechanism

    A. Lewkowycz, Y. Bahri, E. Dyer, J. Sohl-Dickstein, and G. Gur-Ari. The large learning rate phase of deep learning: the catapult mechanism. arXiv preprint arXiv:2003.02218, 2020

  13. [13]

    Liang and G

    S. Liang and G. Montufar. Gradient descent with large step sizes: Chaos and fractal convergence region. In The Fourteenth International Conference on Learning Representations, 2026. URL https://openreview.net/forum?id=wsxGCaBjWC

  14. [14]

    Lobacheva, M

    E. Lobacheva, M. Kodryan, N. Chirkova, A. Malinin, and D. P. Vetrov. On the periodic behavior of neural network training with batch normalization and weight decay. Advances in Neural Information Processing Systems, 34: 0 21545--21556, 2021

  15. [15]

    J. W. Milnor. Remarks on iterated cubic maps. Experimental Mathematics, 1 0 (1): 0 5--24, 1992. doi:10.1080/10586458.1992.10504242

  16. [16]

    J. W. Milnor and C. Tresser. On entropy and monotonicity for real cubic maps. Communications in Mathematical Physics, 209 0 (1): 0 123--178, 2000. doi:10.1007/s002200050018. With an appendix by Adrien Douady and Pierrette Sentenac

  17. [17]

    Song and C

    M. Song and C. Yun. Trajectory alignment: Understanding the edge of stability phenomenon via bifurcation theory. In 37th Annual Conference on Neural Information Processing Systems. Neural Information Processing Systems, 2023

  18. [18]

    Y. Wang, M. Chen, T. Zhao, and M. Tao. Large learning rate tames homogeneity: Convergence and balancing effect. In International Conference on Learning Representations, 2022. URL https://openreview.net/forum?id=3tbDrs77LJ5

  19. [19]

    S. Wiggins. Introduction to Applied Nonlinear Dynamical Systems and Chaos, volume 2 of Texts in Applied Mathematics. Springer, New York, 2 edition, 2003

  20. [20]

    Wortsman, P

    M. Wortsman, P. J. Liu, L. Xiao, K. E. Everett, A. A. Alemi, B. Adlam, J. D. Co-Reyes, I. Gur, A. Kumar, R. Novak, J. Pennington, J. Sohl-Dickstein, K. Xu, J. Lee, J. Gilmer, and S. Kornblith. Small-scale proxies for large-scale transformer training instabilities. In International Conference on Learning Representations (ICLR), 2024

  21. [21]

    J. Wu, P. L. Bartlett, M. Telgarsky, and B. Yu. Large stepsize gradient descent for logistic loss: Non-monotonicity of the loss improves optimization efficiency. In The Thirty Seventh Annual Conference on Learning Theory, pages 5019--5073. PMLR, 2024 a

  22. [22]

    J. Wu, D. Zou, Z. Chen, V. Braverman, Q. Gu, and P. Bartlett. How many pretraining tasks are needed for in-context learning of linear regression? In The Twelfth International Conference on Learning Representations, 2024 b . URL https://openreview.net/forum?id=vSh5ePa0ph

  23. [23]

    J. Wu, P. Marion, and P. Bartlett. Large stepsizes accelerate gradient descent for regularized logistic regression. arXiv preprint arXiv:2506.02336, 2025

  24. [24]

    Zhang, H

    J. Zhang, H. Li, S. Sra, and A. Jadbabaie. Neural network weights do not converge to stationary points: An invariant measure perspective. In International Conference on Machine Learning, pages 26330--26346. PMLR, 2022

  25. [25]

    Zhang, S

    R. Zhang, S. Frei, and P. L. Bartlett. Trained transformers learn linear models in-context. Journal of Machine Learning Research, 25 0 (49): 0 1--55, 2024 a

  26. [26]

    Zhang, J

    R. Zhang, J. Wu, and P. Bartlett. In-context learning of a linear transformer block: benefits of the mlp component and one-step gd initialization. Advances in Neural Information Processing Systems, 37: 0 18310--18361, 2024 b

  27. [27]

    Zhang, J

    R. Zhang, J. Wu, L. Lin, and P. L. Bartlett. Minimax optimal convergence of gradient descent in logistic regression via large and adaptive stepsizes. arXiv preprint arXiv:2504.04105, 2025

  28. [28]

    L. Zhu, C. Liu, A. Radhakrishnan, and M. Belkin. Quadratic models for understanding catapult dynamics of neural networks. In The Twelfth International Conference on Learning Representations, 2024. URL https://openreview.net/forum?id=PvJnX3dwsD

  29. [29]

    X. Zhu, Z. Wang, X. Wang, M. Zhou, and R. Ge. Understanding edge-of-stability training dynamics with a minimalist example. In The Eleventh International Conference on Learning Representations, 2023. URL https://openreview.net/forum?id=p7EagBsMAEO