Large constant learning rates in a two-factor linear transformer model can induce cycles, bounded chaos, or divergence rather than convergence to a single in-context linear-regression solution.
arXiv preprint arXiv:2506.02336 , year=
2 Pith papers cite this work. Polarity classification is still indexing.
2
Pith papers citing it
fields
stat.ML 2years
2026 2verdicts
UNVERDICTED 2representative citing papers
SGD on multiclass cross-entropy loss alternates between curvature-driven oscillations and stable regimes but self-stabilizes to enable best-iterate convergence with large learning rates for linear and two-layer models.
citing papers explorer
-
Large-Step Training Dynamics of a Two-Factor Linear Transformer Model
Large constant learning rates in a two-factor linear transformer model can induce cycles, bounded chaos, or divergence rather than convergence to a single in-context linear-regression solution.
-
SGD at the Edge of Stability: Stochastic Stabilization with Large Learning Rates
SGD on multiclass cross-entropy loss alternates between curvature-driven oscillations and stable regimes but self-stabilizes to enable best-iterate convergence with large learning rates for linear and two-layer models.