Large-Step Training Dynamics of a Two-Factor Linear Transformer Model

Krishnakumar Balasubramanian

arxiv: 2605.21292 · v1 · pith:DZW2CLWLnew · submitted 2026-05-20 · 📊 stat.ML · cs.AI· cs.LG· math.DS

Large-Step Training Dynamics of a Two-Factor Linear Transformer Model

Krishnakumar Balasubramanian This is my paper

Pith reviewed 2026-05-21 03:46 UTC · model grok-4.3

classification 📊 stat.ML cs.AIcs.LGmath.DS

keywords linear transformerstraining dynamicslarge learning ratesgradient descentchaotic dynamicsin-context learningphase transitionsattractors

0 comments

The pith

Large constant learning rates can shift linear transformer training from in-context regression to cycles, chaos or divergence.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper investigates the training dynamics of a simplified one-prompt linear transformer using gradient descent with large constant learning rates. By normalizing the problem, it reduces the dynamics to a two-factor product map parameterized by an effective step size mu. Analysis of this map on balanced and full two-dimensional slices reveals transitions from convergence to periodic behavior, bounded chaos, and divergence as mu increases. This is important because it indicates that large learning rates do not simply accelerate training but can alter the final learned behavior or prevent convergence altogether in transformer models.

Core claim

What carries the argument

The two-factor product map obtained after normalization of the one-prompt linear-transformer training dynamics, which allows reduction to a scalar cubic map on the balanced slice and admits an explicit invariant Chebyshev ellipse in the full 2D system.

Load-bearing premise

After normalization, the one-prompt linear-transformer training dynamics reduce exactly to a two-factor product map with effective step-size mu.

What would settle it

Running numerical simulations of the gradient descent updates for the linear transformer with effective step sizes mu just above and below the predicted stability thresholds and checking if the trajectories enter periodic cycles or diverge as forecasted by the cubic map.

Figures

Figures reproduced from arXiv: 2605.21292 by Krishnakumar Balasubramanian.

**Figure 2.** Figure 2: Balanced-line bifurcation diagram for Fµ. Each column is a single GD run on ℓµ from a balanced initial condition; points are the asymptotic error after burn-in. Vertical dashed lines are the four analytic thresholds 2 √ 2 − 2 ≈ 0.83, 1, √ 5 − 1 ≈ 1.24, 2. Divergent µ values are marked as red ticks at the bottom. 42 [PITH_FULL_IMAGE:figures/full_fig_p042_2.png] view at source ↗

**Figure 3.** Figure 3: Phase portraits of Φµ. Four initial conditions per panel (black-bordered dots labelled A–D), two inside and two outside the Chebyshev ellipse Eµ. Arrowheads show iteration direction. Interior orbits converge to the zero-error hyperbola Mµ at µ = 0.7 (left) and to a period-two cycle at µ = 1.3 (right); exterior orbits diverge in both panels. 3 2 1 0 1 2 3 a 3 2 1 0 1 2 3 b Trajectories from the full-batch s… view at source ↗

**Figure 4.** Figure 4: A single mini-batch can cross the full-batch separatrix. Left: [PITH_FULL_IMAGE:figures/full_fig_p043_4.png] view at source ↗

**Figure 5.** Figure 5: Transverse Lyapunov exponent of the balanced line under stochastic batch switching, [PITH_FULL_IMAGE:figures/full_fig_p044_5.png] view at source ↗

**Figure 6.** Figure 6: Full-LSA multi-prompt mini-batch training. Left: population loss. Middle: instantaneous [PITH_FULL_IMAGE:figures/full_fig_p044_6.png] view at source ↗

read the original abstract

Gradient-flow analyses show that simplified linear transformers can learn the in-context linear-regression algorithm, but they do not explain the finite-step behavior of gradient descent at large learning rates. Motivated by empirical work on high-learning-rate transformer instabilities and by the cubic-map phase diagram for quadratic regression, we study an exactly reducible one-prompt linear-transformer training problem. After normalization, the dynamics reduce to a two-factor product map with an effective step-size parameter \(\mu\). On the balanced slice, this map recovers the known scalar cubic transition from monotone convergence to catapult convergence, periodic and chaotic bounded nonconvergence, and divergence. We then analyze the full two-dimensional system and show that, for \(0<\mu<2\), it has an explicit invariant Chebyshev ellipse separating forward-invariant regions; this ellipse carries off-balanced chaotic dynamics but is transversely repelling, while balanced scalar attractors can be transversely attracting. These results show that large constant learning rates can change the training attractor of the learned transformer rather than merely accelerating convergence: beyond sharp stability thresholds, finite-step training may settle into cycles, bounded chaos, or divergence instead of a single in-context linear-regression solution. We also discuss the consequences for mini-batch gradient descent based training methods.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper reduces normalized one-prompt linear transformer training to an exact 2D product map, constructs an invariant Chebyshev ellipse that is transversely repelling, and shows large steps can replace the regression attractor with cycles or chaos.

read the letter

The core result is that after normalization the one-prompt dynamics collapse exactly to a two-factor map with step-size parameter mu. On the balanced slice this recovers the cubic-map transitions from convergence to periodic, chaotic, or divergent behavior. In the full 2D system they exhibit an explicit invariant Chebyshev ellipse that carries off-balance chaos but repels transversely, while balanced fixed points can attract transversely. This is the new piece: a concrete 2D phase diagram that extends the scalar case and directly ties large constant learning rates to attractor switching rather than just faster convergence.

Referee Report

2 major / 2 minor

Summary. The paper analyzes the finite-step gradient descent dynamics of a simplified one-prompt linear transformer model. It claims that after a normalization step, the training dynamics exactly reduce to a two-factor product map parameterized by an effective step size μ. On the balanced slice, this map reproduces the phase diagram of the scalar cubic map, including transitions to periodic, chaotic, and divergent behaviors. For the full 2D system, an explicit invariant Chebyshev ellipse is identified for 0 < μ < 2, which is transversely repelling while balanced attractors can be attracting. The results imply that large constant learning rates can alter the training attractor away from the in-context linear regression solution towards cycles, bounded chaos, or divergence.

Significance. If the exact reduction holds, this provides a rigorous mathematical framework for understanding how large learning rates affect transformer training dynamics beyond the gradient-flow limit. It offers explicit stability thresholds and an invariant set analysis that could explain empirical instabilities in high-LR training. The explicit construction of the invariant ellipse and the transverse stability analysis are notable strengths, as is the connection to the known cubic map phenomenology without post-hoc fitting. This could inform the design of training schedules for transformers.

major comments (2)

[Abstract and model/normalization section] Abstract and the normalization step in the model section: the claim that the one-prompt linear-transformer gradient-descent updates reduce exactly to the two-factor product map with effective step-size μ is load-bearing for all stability thresholds and attractor conclusions, yet the algebraic cancellations that achieve this reduction after normalization are not shown explicitly; it is therefore unclear whether the reduction is exact for arbitrary prompt statistics or requires additional assumptions on data or initialization.
[Full 2D system analysis] The 2D system analysis section: the assertion of an explicit invariant Chebyshev ellipse for 0<μ<2 that separates forward-invariant regions and is transversely repelling requires the explicit verification that the ellipse is mapped into itself and the computation of the transverse Lyapunov exponent or linearization; without these steps the conclusion that off-balanced chaotic dynamics do not attract from the balanced slice does not follow.

minor comments (2)

[Notation and parameters] The relation between the original learning rate and the effective parameter μ should be written as a single displayed equation immediately after the normalization is introduced.
[Figures] Phase portraits or bifurcation diagrams for the 2D map would benefit from explicit annotation of the Chebyshev ellipse and the basins of attraction.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their thorough review and valuable feedback on our work. We address the major comments below and will incorporate the suggested clarifications in the revised manuscript.

read point-by-point responses

Referee: [Abstract and model/normalization section] Abstract and the normalization step in the model section: the claim that the one-prompt linear-transformer gradient-descent updates reduce exactly to the two-factor product map with effective step-size μ is load-bearing for all stability thresholds and attractor conclusions, yet the algebraic cancellations that achieve this reduction after normalization are not shown explicitly; it is therefore unclear whether the reduction is exact for arbitrary prompt statistics or requires additional assumptions on data or initialization.

Authors: We appreciate this observation. Upon review, we recognize that while the reduction is derived in the manuscript, the intermediate algebraic steps were not presented in full detail. In the revised version, we will expand the model section to explicitly show the cancellations leading to the two-factor product map. This derivation holds for arbitrary prompt statistics under the normalization procedure described, without further assumptions on data or initialization beyond those stated in the paper. revision: yes
Referee: [Full 2D system analysis] The 2D system analysis section: the assertion of an explicit invariant Chebyshev ellipse for 0<μ<2 that separates forward-invariant regions and is transversely repelling requires the explicit verification that the ellipse is mapped into itself and the computation of the transverse Lyapunov exponent or linearization; without these steps the conclusion that off-balanced chaotic dynamics do not attract from the balanced slice does not follow.

Authors: We agree that a more explicit verification is necessary to fully support the claims regarding the invariant ellipse. In the updated manuscript, we will provide the step-by-step verification that the Chebyshev ellipse is mapped into itself under the dynamics for 0 < μ < 2. Additionally, we will include the computation of the transverse linearization and the associated Lyapunov exponent to rigorously demonstrate that the ellipse is transversely repelling, while balanced attractors remain attracting in the transverse direction. This will strengthen the conclusion that off-balanced chaotic dynamics do not attract trajectories starting from the balanced slice. revision: yes

Circularity Check

0 steps flagged

Derivation chain is self-contained; reduction to product map follows from model equations without circular reduction

full rationale

The paper starts from the one-prompt linear-transformer loss and gradient-descent updates, applies a stated normalization, and derives the two-factor product map with parameter μ directly from those equations. The balanced-slice recovery of the cubic map is presented as a known scalar case recovered by restriction, not as a fitted or self-defined prediction. The Chebyshev ellipse is constructed explicitly as an invariant set for the 2D flow. No load-bearing self-citation, no post-hoc fitting renamed as prediction, and no ansatz smuggled via prior work are required for the central stability thresholds or attractor-change claims. The derivation remains independent of its target conclusions.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The central claim rests on the exact reducibility of the transformer training dynamics to a two-factor product map after normalization and on the existence of an invariant Chebyshev ellipse whose transverse stability properties determine the attractors.

free parameters (1)

mu
Effective step-size parameter that controls the bifurcation structure of the reduced map.

axioms (1)

domain assumption The one-prompt linear-transformer training problem is exactly reducible to a two-factor product map after normalization.
This reduction is invoked at the outset to enable all subsequent dynamical analysis.

pith-pipeline@v0.9.0 · 5753 in / 1398 out tokens · 63236 ms · 2026-05-21T03:46:19.481174+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

After normalization, the dynamics reduce to a two-factor product map with an effective step-size parameter μ. On the balanced slice, this map recovers the known scalar cubic transition... the full two-dimensional system... has an explicit invariant Chebyshev ellipse... e+ = C(e) = e³−3e
IndisputableMonolith/Foundation/ArithmeticFromLogic.lean LogicNat recovery and embed_strictMono unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

the map Φ_μ(a,b)=(a−(ab−μ)b,b−(ab−μ)a)

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

29 extracted references · 29 canonical work pages

[1]

Agarwala, F

A. Agarwala, F. Pedregosa, and J. Pennington. Second-order regression models exhibit progressive sharpening to the edge of stability. In International Conference on Machine Learning, pages 169--195. PMLR, 2023

work page 2023
[2]

Arora, Z

S. Arora, Z. Li, and A. Panigrahi. Understanding gradient descent on the edge of stability in deep learning. In International Conference on Machine Learning, pages 948--1024. PMLR, 2022

work page 2022
[3]

Chen and J

L. Chen and J. Bruna. Beyond the edge of stability via two-step gradient updates. In International Conference on Machine Learning, pages 4330--4391. PMLR, 2023

work page 2023
[4]

X. Chen, K. Balasubramanian, P. Ghosal, and B. K. Agrawalla. From stability to chaos: Analyzing gradient descent dynamics in quadratic regression. Transactions on Machine Learning Research, 2024. ISSN 2835-8856. URL https://openreview.net/forum?id=Wiklo5VpG7

work page 2024
[5]

Cohen, S

J. Cohen, S. Kaur, Y. Li, J. Z. Kolter, and A. Talwalkar. Gradient descent on neural networks typically occurs at the edge of stability. In International Conference on Learning Representations, 2021. URL https://openreview.net/forum?id=jh-rTtvkGeM

work page 2021
[6]

Damian, E

A. Damian, E. Nichani, and J. D. Lee. Self-stabilization: The implicit bias of gradient descent at the edge of stability. ICLR 2023, 2023

work page 2023
[7]

R. L. Devaney. An Introduction to Chaotic Dynamical Systems. Westview Press, Boulder, CO, 2 edition, 2003

work page 2003
[8]

Gilmer, B

J. Gilmer, B. Ghorbani, A. Garg, S. Kudugunta, B. Neyshabur, D. Cardoze, G. E. Dahl, Z. Nado, and O. Firat. A loss curvature perspective on training instabilities of deep learning models. In International Conference on Learning Representations, 2022. URL https://openreview.net/forum?id=OcKMT-36vUs

work page 2022
[9]

Herrmann, M

L. Herrmann, M. Granz, and T. Landgraf. Chaotic dynamics are intrinsic to neural network training with sgd. Advances in Neural Information Processing Systems, 35: 0 5219--5229, 2022

work page 2022
[10]

Kodryan, E

M. Kodryan, E. Lobacheva, M. Nakhodnov, and D. Vetrov. Training scale-invariant neural networks on the sphere can happen in three regimes. Advances in Neural Information Processing Systems, 35: 0 14058--14070, 2022

work page 2022
[11]

Kong and M

L. Kong and M. Tao. Stochasticity of deterministic gradient descent: Large learning rate for multiscale objective function. Advances in neural information processing systems, 33: 0 2625--2638, 2020

work page 2020
[12]

The large learning rate phase of deep learning: the catapult mechanism

A. Lewkowycz, Y. Bahri, E. Dyer, J. Sohl-Dickstein, and G. Gur-Ari. The large learning rate phase of deep learning: the catapult mechanism. arXiv preprint arXiv:2003.02218, 2020

work page arXiv 2003
[13]

Liang and G

S. Liang and G. Montufar. Gradient descent with large step sizes: Chaos and fractal convergence region. In The Fourteenth International Conference on Learning Representations, 2026. URL https://openreview.net/forum?id=wsxGCaBjWC

work page 2026
[14]

Lobacheva, M

E. Lobacheva, M. Kodryan, N. Chirkova, A. Malinin, and D. P. Vetrov. On the periodic behavior of neural network training with batch normalization and weight decay. Advances in Neural Information Processing Systems, 34: 0 21545--21556, 2021

work page 2021
[15]

J. W. Milnor. Remarks on iterated cubic maps. Experimental Mathematics, 1 0 (1): 0 5--24, 1992. doi:10.1080/10586458.1992.10504242

work page doi:10.1080/10586458.1992.10504242 1992
[16]

J. W. Milnor and C. Tresser. On entropy and monotonicity for real cubic maps. Communications in Mathematical Physics, 209 0 (1): 0 123--178, 2000. doi:10.1007/s002200050018. With an appendix by Adrien Douady and Pierrette Sentenac

work page doi:10.1007/s002200050018 2000
[17]

Song and C

M. Song and C. Yun. Trajectory alignment: Understanding the edge of stability phenomenon via bifurcation theory. In 37th Annual Conference on Neural Information Processing Systems. Neural Information Processing Systems, 2023

work page 2023
[18]

Y. Wang, M. Chen, T. Zhao, and M. Tao. Large learning rate tames homogeneity: Convergence and balancing effect. In International Conference on Learning Representations, 2022. URL https://openreview.net/forum?id=3tbDrs77LJ5

work page 2022
[19]

S. Wiggins. Introduction to Applied Nonlinear Dynamical Systems and Chaos, volume 2 of Texts in Applied Mathematics. Springer, New York, 2 edition, 2003

work page 2003
[20]

Wortsman, P

M. Wortsman, P. J. Liu, L. Xiao, K. E. Everett, A. A. Alemi, B. Adlam, J. D. Co-Reyes, I. Gur, A. Kumar, R. Novak, J. Pennington, J. Sohl-Dickstein, K. Xu, J. Lee, J. Gilmer, and S. Kornblith. Small-scale proxies for large-scale transformer training instabilities. In International Conference on Learning Representations (ICLR), 2024

work page 2024
[21]

J. Wu, P. L. Bartlett, M. Telgarsky, and B. Yu. Large stepsize gradient descent for logistic loss: Non-monotonicity of the loss improves optimization efficiency. In The Thirty Seventh Annual Conference on Learning Theory, pages 5019--5073. PMLR, 2024 a

work page 2024
[22]

J. Wu, D. Zou, Z. Chen, V. Braverman, Q. Gu, and P. Bartlett. How many pretraining tasks are needed for in-context learning of linear regression? In The Twelfth International Conference on Learning Representations, 2024 b . URL https://openreview.net/forum?id=vSh5ePa0ph

work page 2024
[23]

J. Wu, P. Marion, and P. Bartlett. Large stepsizes accelerate gradient descent for regularized logistic regression. arXiv preprint arXiv:2506.02336, 2025

work page arXiv 2025
[24]

Zhang, H

J. Zhang, H. Li, S. Sra, and A. Jadbabaie. Neural network weights do not converge to stationary points: An invariant measure perspective. In International Conference on Machine Learning, pages 26330--26346. PMLR, 2022

work page 2022
[25]

Zhang, S

R. Zhang, S. Frei, and P. L. Bartlett. Trained transformers learn linear models in-context. Journal of Machine Learning Research, 25 0 (49): 0 1--55, 2024 a

work page 2024
[26]

Zhang, J

R. Zhang, J. Wu, and P. Bartlett. In-context learning of a linear transformer block: benefits of the mlp component and one-step gd initialization. Advances in Neural Information Processing Systems, 37: 0 18310--18361, 2024 b

work page 2024
[27]

Zhang, J

R. Zhang, J. Wu, L. Lin, and P. L. Bartlett. Minimax optimal convergence of gradient descent in logistic regression via large and adaptive stepsizes. arXiv preprint arXiv:2504.04105, 2025

work page arXiv 2025
[28]

L. Zhu, C. Liu, A. Radhakrishnan, and M. Belkin. Quadratic models for understanding catapult dynamics of neural networks. In The Twelfth International Conference on Learning Representations, 2024. URL https://openreview.net/forum?id=PvJnX3dwsD

work page 2024
[29]

X. Zhu, Z. Wang, X. Wang, M. Zhou, and R. Ge. Understanding edge-of-stability training dynamics with a minimalist example. In The Eleventh International Conference on Learning Representations, 2023. URL https://openreview.net/forum?id=p7EagBsMAEO

work page 2023

[1] [1]

Agarwala, F

A. Agarwala, F. Pedregosa, and J. Pennington. Second-order regression models exhibit progressive sharpening to the edge of stability. In International Conference on Machine Learning, pages 169--195. PMLR, 2023

work page 2023

[2] [2]

Arora, Z

S. Arora, Z. Li, and A. Panigrahi. Understanding gradient descent on the edge of stability in deep learning. In International Conference on Machine Learning, pages 948--1024. PMLR, 2022

work page 2022

[3] [3]

Chen and J

L. Chen and J. Bruna. Beyond the edge of stability via two-step gradient updates. In International Conference on Machine Learning, pages 4330--4391. PMLR, 2023

work page 2023

[4] [4]

X. Chen, K. Balasubramanian, P. Ghosal, and B. K. Agrawalla. From stability to chaos: Analyzing gradient descent dynamics in quadratic regression. Transactions on Machine Learning Research, 2024. ISSN 2835-8856. URL https://openreview.net/forum?id=Wiklo5VpG7

work page 2024

[5] [5]

Cohen, S

J. Cohen, S. Kaur, Y. Li, J. Z. Kolter, and A. Talwalkar. Gradient descent on neural networks typically occurs at the edge of stability. In International Conference on Learning Representations, 2021. URL https://openreview.net/forum?id=jh-rTtvkGeM

work page 2021

[6] [6]

Damian, E

A. Damian, E. Nichani, and J. D. Lee. Self-stabilization: The implicit bias of gradient descent at the edge of stability. ICLR 2023, 2023

work page 2023

[7] [7]

R. L. Devaney. An Introduction to Chaotic Dynamical Systems. Westview Press, Boulder, CO, 2 edition, 2003

work page 2003

[8] [8]

Gilmer, B

J. Gilmer, B. Ghorbani, A. Garg, S. Kudugunta, B. Neyshabur, D. Cardoze, G. E. Dahl, Z. Nado, and O. Firat. A loss curvature perspective on training instabilities of deep learning models. In International Conference on Learning Representations, 2022. URL https://openreview.net/forum?id=OcKMT-36vUs

work page 2022

[9] [9]

Herrmann, M

L. Herrmann, M. Granz, and T. Landgraf. Chaotic dynamics are intrinsic to neural network training with sgd. Advances in Neural Information Processing Systems, 35: 0 5219--5229, 2022

work page 2022

[10] [10]

Kodryan, E

M. Kodryan, E. Lobacheva, M. Nakhodnov, and D. Vetrov. Training scale-invariant neural networks on the sphere can happen in three regimes. Advances in Neural Information Processing Systems, 35: 0 14058--14070, 2022

work page 2022

[11] [11]

Kong and M

L. Kong and M. Tao. Stochasticity of deterministic gradient descent: Large learning rate for multiscale objective function. Advances in neural information processing systems, 33: 0 2625--2638, 2020

work page 2020

[12] [12]

The large learning rate phase of deep learning: the catapult mechanism

A. Lewkowycz, Y. Bahri, E. Dyer, J. Sohl-Dickstein, and G. Gur-Ari. The large learning rate phase of deep learning: the catapult mechanism. arXiv preprint arXiv:2003.02218, 2020

work page arXiv 2003

[13] [13]

Liang and G

S. Liang and G. Montufar. Gradient descent with large step sizes: Chaos and fractal convergence region. In The Fourteenth International Conference on Learning Representations, 2026. URL https://openreview.net/forum?id=wsxGCaBjWC

work page 2026

[14] [14]

Lobacheva, M

E. Lobacheva, M. Kodryan, N. Chirkova, A. Malinin, and D. P. Vetrov. On the periodic behavior of neural network training with batch normalization and weight decay. Advances in Neural Information Processing Systems, 34: 0 21545--21556, 2021

work page 2021

[15] [15]

J. W. Milnor. Remarks on iterated cubic maps. Experimental Mathematics, 1 0 (1): 0 5--24, 1992. doi:10.1080/10586458.1992.10504242

work page doi:10.1080/10586458.1992.10504242 1992

[16] [16]

J. W. Milnor and C. Tresser. On entropy and monotonicity for real cubic maps. Communications in Mathematical Physics, 209 0 (1): 0 123--178, 2000. doi:10.1007/s002200050018. With an appendix by Adrien Douady and Pierrette Sentenac

work page doi:10.1007/s002200050018 2000

[17] [17]

Song and C

M. Song and C. Yun. Trajectory alignment: Understanding the edge of stability phenomenon via bifurcation theory. In 37th Annual Conference on Neural Information Processing Systems. Neural Information Processing Systems, 2023

work page 2023

[18] [18]

Y. Wang, M. Chen, T. Zhao, and M. Tao. Large learning rate tames homogeneity: Convergence and balancing effect. In International Conference on Learning Representations, 2022. URL https://openreview.net/forum?id=3tbDrs77LJ5

work page 2022

[19] [19]

S. Wiggins. Introduction to Applied Nonlinear Dynamical Systems and Chaos, volume 2 of Texts in Applied Mathematics. Springer, New York, 2 edition, 2003

work page 2003

[20] [20]

Wortsman, P

M. Wortsman, P. J. Liu, L. Xiao, K. E. Everett, A. A. Alemi, B. Adlam, J. D. Co-Reyes, I. Gur, A. Kumar, R. Novak, J. Pennington, J. Sohl-Dickstein, K. Xu, J. Lee, J. Gilmer, and S. Kornblith. Small-scale proxies for large-scale transformer training instabilities. In International Conference on Learning Representations (ICLR), 2024

work page 2024

[21] [21]

J. Wu, P. L. Bartlett, M. Telgarsky, and B. Yu. Large stepsize gradient descent for logistic loss: Non-monotonicity of the loss improves optimization efficiency. In The Thirty Seventh Annual Conference on Learning Theory, pages 5019--5073. PMLR, 2024 a

work page 2024

[22] [22]

J. Wu, D. Zou, Z. Chen, V. Braverman, Q. Gu, and P. Bartlett. How many pretraining tasks are needed for in-context learning of linear regression? In The Twelfth International Conference on Learning Representations, 2024 b . URL https://openreview.net/forum?id=vSh5ePa0ph

work page 2024

[23] [23]

J. Wu, P. Marion, and P. Bartlett. Large stepsizes accelerate gradient descent for regularized logistic regression. arXiv preprint arXiv:2506.02336, 2025

work page arXiv 2025

[24] [24]

Zhang, H

J. Zhang, H. Li, S. Sra, and A. Jadbabaie. Neural network weights do not converge to stationary points: An invariant measure perspective. In International Conference on Machine Learning, pages 26330--26346. PMLR, 2022

work page 2022

[25] [25]

Zhang, S

R. Zhang, S. Frei, and P. L. Bartlett. Trained transformers learn linear models in-context. Journal of Machine Learning Research, 25 0 (49): 0 1--55, 2024 a

work page 2024

[26] [26]

Zhang, J

R. Zhang, J. Wu, and P. Bartlett. In-context learning of a linear transformer block: benefits of the mlp component and one-step gd initialization. Advances in Neural Information Processing Systems, 37: 0 18310--18361, 2024 b

work page 2024

[27] [27]

Zhang, J

R. Zhang, J. Wu, L. Lin, and P. L. Bartlett. Minimax optimal convergence of gradient descent in logistic regression via large and adaptive stepsizes. arXiv preprint arXiv:2504.04105, 2025

work page arXiv 2025

[28] [28]

L. Zhu, C. Liu, A. Radhakrishnan, and M. Belkin. Quadratic models for understanding catapult dynamics of neural networks. In The Twelfth International Conference on Learning Representations, 2024. URL https://openreview.net/forum?id=PvJnX3dwsD

work page 2024

[29] [29]

X. Zhu, Z. Wang, X. Wang, M. Zhou, and R. Ge. Understanding edge-of-stability training dynamics with a minimalist example. In The Eleventh International Conference on Learning Representations, 2023. URL https://openreview.net/forum?id=p7EagBsMAEO

work page 2023