Two failure modes of deep transformers and how to avoid them: a unified theory of signal propagation at initialisation

Alessio Giorlandino; Sebastian Goldt

arxiv: 2505.24333 · v3 · pith:I477SBLGnew · submitted 2025-05-30 · 📊 stat.ML · cond-mat.dis-nn· cond-mat.stat-mech· cs.LG

Two failure modes of deep transformers and how to avoid them: a unified theory of signal propagation at initialisation

Alessio Giorlandino , Sebastian Goldt This is my paper

Pith reviewed 2026-05-22 02:38 UTC · model grok-4.3

classification 📊 stat.ML cond-mat.dis-nncond-mat.stat-mechcs.LG

keywords transformer initializationself-attentionsignal propagationrank collapseentropy collapseRandom Energy Modeltrainability diagram

0 comments

The pith

A unified theory maps self-attention to the Random Energy Model to derive exact initial scales that prevent both rank collapse and entropy collapse in deep transformers.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes an analytical description of how signals move forward and gradients move backward through entire transformer stacks at the moment of initialization. It treats self-attention exactly by drawing a direct correspondence to the Random Energy Model, which supplies closed-form statistics for attention scores and their effect on token representations. From this, the authors obtain simple algorithms that draw trainability diagrams: regions in the space of weight and residual-connection scales where neither all tokens become identical nor attention concentrates on one token. The resulting prescriptions cover networks that also contain layer normalization, skip connections, and MLPs, and they predict the onset of vanishing gradients as well.

Core claim

The central claim is that an exact treatment of the self-attention layer, obtained by mapping it onto the Random Energy Model, yields quantitative predictions for the variance of the initial weights and the strength of residual connections that keep both forward signals and backward gradients well-behaved in arbitrarily deep transformers.

What carries the argument

The formal parallel between the self-attention layer and the Random Energy Model, which supplies the exact distribution of attention scores needed to track signal propagation through the entire block.

If this is right

Choosing initial weight variances inside the safe region of the trainability diagram eliminates rank collapse at initialization.
Tuning residual-connection scales inside the same region prevents entropy collapse and the associated training instability.
The same diagrams identify a regime in which gradients remain order-one rather than vanishing or exploding at the start of training.
The prescriptions apply uniformly to stacks containing layer normalization, skip connections, and MLPs.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same mapping technique could be used to derive initialization rules for attention variants such as linear attention or grouped attention.
The trainability diagrams might be extended to the case of fine-tuning by tracking how the statistics evolve once the network has moved away from the initial point.
Because the theory is asymptotic in width but exact in depth, it could guide the design of depth-specific initialization schedules rather than uniform scaling.

Load-bearing premise

The mapping of the attention mechanism onto the Random Energy Model must hold with sufficient accuracy for the derived statistics to remain valid.

What would settle it

Train a deep transformer with the weight and residual scales read off the predicted trainability diagram and check whether the attention matrix remains full-rank and the entropy of attention scores stays away from zero throughout the first few thousand steps, compared with a standard initialization that is known to collapse.

Figures

Figures reproduced from arXiv: 2505.24333 by Alessio Giorlandino, Sebastian Goldt.

**Figure 1.** Figure 1: Two failure modes of Transformers at initialisation, and how to avoid them. (a) Rank collapse occurs when the self-attention layer attends uniformly to all tokens, mapping all input tokens into the same output token. (b) Entropy collapse is a regime of highly saturated attention matrices which attend to random, semantically meaningless patterns, leading to training instability (Zhai et al., 2023). (c) Trai… view at source ↗

**Figure 2.** Figure 2: Phase diagram for a single layer of self-attention. We use result 1 to plot the average cosine similarity between pairs of tokens after one layer of self-attention as a function of the query/key variance parameter β and the input average cosine similarity ρ. (Left): Theoretical phase diagram obtained from result 1 (with q = 1 and p = ρ). For β < βc, we observe a rank collapse phase, where all input tokens … view at source ↗

**Figure 3.** Figure 3: (a, b) A phase transition in the impact of query / key initialisation on training dynamics. Average Shannon entropy of attention’s row and the test loss of a Transformer with a single layer of self-attention trained on masked language modelling on TinyStories as we vary the scale of the initialisation from small to large initial weights (blue to red). Small initial weights (blue) permit attention to divers… view at source ↗

**Figure 4.** Figure 4: Theoretical prediction of the evolution with depth of the average cosine similarity for the standard Transformer and the Gain-controlled Transformer under both LN strategies. Rank collapse is avoided simply by removing the mean value in the self-attention layer. Here, we set αSA = αMLP = 1. Avoiding All Collapses: gain-controlled attention. A recent line of work has sought to alleviate rank collapse by … view at source ↗

**Figure 5.** Figure 5: Theory and experiments (T = 105 ) comparison of the computation of Y (2)(β), finite size effects are visible around the phase transition. Now we take the 1-RSB ansatz for Q: the n replicas are diveded into n x groups of x elements which are in the same energy configurations. Moreover, we need to consider exponentially long sequences, i.e. we take T = e N and control N. This implies: S(Q) ≃ e N n x Xn ab Qa… view at source ↗

**Figure 6.** Figure 6: Training 30 layers of vanilla and Gain-controlled Transformer on TinyStories. [PITH_FULL_IMAGE:figures/full_fig_p024_6.png] view at source ↗

**Figure 7.** Figure 7: Spectrum of a 512 × 512 self-attention matrix for various values of the query/key variance parameter β. 24 [PITH_FULL_IMAGE:figures/full_fig_p024_7.png] view at source ↗

**Figure 8.** Figure 8: Entropy collapse can be partially mitigated by smaller learning rates. [PITH_FULL_IMAGE:figures/full_fig_p025_8.png] view at source ↗

**Figure 9.** Figure 9: Phase Transition to Infinitely Deep Signal Propagation. (Left) Cosine-similarity update map of a full transformer block with tanh activations in the MLPs. By tuning the MLP variance to enter the chaotic regime, the collapsing effect of self-attention can be counterbalanced, resulting in a non-trivial fixed point in the similarity dynamics. (Right) Iterating the update map reveals the evolution of cosine si… view at source ↗

**Figure 10.** Figure 10: Training loss corresponding to the experiment shown in fig. 1e. [PITH_FULL_IMAGE:figures/full_fig_p026_10.png] view at source ↗

**Figure 11.** Figure 11: Training loss corresponding to the experiment shown in fig. 3b. [PITH_FULL_IMAGE:figures/full_fig_p026_11.png] view at source ↗

**Figure 12.** Figure 12: Propagation in Autoregressive Models. Cosine similarity between token representations as a function of layer depth for three values of β. The solid lines show our theoretical predictions, while the markers with error bars show empirical measurements obtained by averaging over all token pairs, across 10 sequences of length ≈ 200 and 10 independent model initialisations. Left: For the first 50 tokens, the s… view at source ↗

read the original abstract

Finding the right initialisation for neural networks is crucial to ensure smooth training and good performance. In transformers, the wrong initialisation can lead to one of two failure modes of self-attention layers: rank collapse, where all tokens collapse into similar representations, and entropy collapse, where highly concentrated attention scores lead to training instability. While previous work has studied different scaling regimes for transformers, an asymptotically exact, down-to-the constant prescription for how to initialise transformers has so far been lacking. Here, we provide an analytical theory of signal propagation through deep transformers with self-attention, layer normalisation, skip connections and MLP. Our theory yields a simple algorithm to compute trainability diagrams that identify the correct choice of initialisation hyper-parameters for a given architecture. We overcome the key challenge, an exact treatment of the self-attention layer, by establishing a formal parallel with the Random Energy Model from statistical physics. We also analyse gradients in the backward path and determine the regime where gradients vanish at initialisation. We demonstrate the versatility of our framework through three case studies. Our theoretical framework gives a unified perspective on the two failure modes of self-attention and gives quantitative predictions on the scale of both weights and residual connections that guarantee smooth training.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper maps self-attention to the Random Energy Model for exact initialization rules that target rank and entropy collapse, but the mapping's independence assumption is the part that needs checking.

read the letter

This paper's main claim is that a formal parallel between self-attention and the Random Energy Model yields the first asymptotically exact initialization prescription for deep transformers, including concrete scales for weights and residual connections that keep both forward signals and gradients healthy. It covers the full block with layer norm and skips, and it produces trainability diagrams plus three case studies to show the rules in action. That is the new piece: prior scaling work stayed at approximate regimes, while this one aims for constant-level accuracy via the REM partition function treatment of attention logits. The backward pass analysis for vanishing gradients is a useful addition as well. The framework is presented as first-principles rather than fitted, which is a plus if the derivations hold. The soft spot is exactly the one the stress-test note flags. Attention logits after QK^T / sqrt(d) and layer norm on the residual stream are neither independent nor purely Gaussian once skips and the MLP enter, so higher-order correlations could shift the predicted boundaries between collapse regimes. The paper will stand or fall on whether the quantitative matches to simulation survive without hidden parameter choices that align the diagrams to the same regimes they predict. If the REM equivalence is only approximate in practice, the exactness claim needs to be dialed back. This is aimed at researchers who initialize or scale transformers and want quantitative guidance rather than rules of thumb. Readers who already follow signal propagation or statistical-physics approaches to networks will extract the most from the diagrams and the mapping. It is coherent enough on its own terms to deserve a serious referee, who can verify the derivation steps and the simulation matches directly. I would send it to peer review.

Referee Report

2 major / 2 minor

Summary. The paper develops a unified analytical theory of signal propagation through deep transformers at initialization, incorporating self-attention, layer normalization, skip connections, and MLPs. It maps the self-attention layer to the Random Energy Model from statistical physics to obtain an asymptotically exact treatment of attention statistics, derives trainability diagrams that prescribe initialization scales for weights and residual connections to avoid rank collapse and entropy collapse, analyzes gradient vanishing in the backward pass, and validates the framework on three case studies.

Significance. If the REM parallel and resulting scalings hold, the work would supply a concrete, quantitative initialization prescription that unifies the two failure modes and improves on prior scaling-regime analyses. The explicit treatment of both forward signal propagation and backward gradients, together with the case-study demonstrations, would make the contribution practically useful for architecture design.

major comments (2)

[§3] §3: The formal parallel to the REM treats attention logits as i.i.d. Gaussians whose extremes are governed by the REM partition function. The manuscript does not provide a direct verification that the combination of QK^T/sqrt(d), layer-norm on the residual stream, and skip connections leaves the logits sufficiently uncorrelated and Gaussian for the REM extreme-value statistics to remain exact; this assumption is load-bearing for the boundaries of the trainability diagrams.
[Trainability diagrams] Trainability diagrams (e.g., the figures showing regimes for rank vs. entropy collapse): the predicted scales for residual connections and weight variances are obtained from the REM mapping. If residual-induced correlations shift the effective temperature or partition-function behavior, the quantitative boundaries would move; the paper should include a sensitivity check or perturbative expansion around the REM limit.

minor comments (2)

[Abstract] The abstract states that the theory yields 'an asymptotically exact, down-to-the constant prescription'; the main text should explicitly identify which constants are fixed by the REM analysis versus which remain free parameters.
[Figures] Simulation overlays on the trainability diagrams would benefit from reporting the number of random seeds and the precise metric used to detect rank or entropy collapse, to allow readers to judge quantitative agreement.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive and detailed comments on our manuscript. We address each major comment below and indicate the changes we will make in the revised version.

read point-by-point responses

Referee: [§3] §3: The formal parallel to the REM treats attention logits as i.i.d. Gaussians whose extremes are governed by the REM partition function. The manuscript does not provide a direct verification that the combination of QK^T/sqrt(d), layer-norm on the residual stream, and skip connections leaves the logits sufficiently uncorrelated and Gaussian for the REM extreme-value statistics to remain exact; this assumption is load-bearing for the boundaries of the trainability diagrams.

Authors: In the derivation of Section 3 we show analytically that, in the large-d limit, the combination of the scaled dot-product, layer normalization, and residual addition yields logits whose joint distribution converges to that of i.i.d. standard Gaussians; the argument relies on the central-limit theorem for the high-dimensional projections together with the centering and scaling enforced by layer norm. We agree that an explicit numerical check of this asymptotic regime for finite but practically relevant dimensions would strengthen the presentation. We will therefore add an appendix containing Monte-Carlo histograms and correlation matrices of the logits across a range of residual scales and model widths. revision: yes
Referee: [Trainability diagrams] Trainability diagrams (e.g., the figures showing regimes for rank vs. entropy collapse): the predicted scales for residual connections and weight variances are obtained from the REM mapping. If residual-induced correlations shift the effective temperature or partition-function behavior, the quantitative boundaries would move; the paper should include a sensitivity check or perturbative expansion around the REM limit.

Authors: The trainability diagrams are obtained from the exact REM solution in the thermodynamic limit, where residual-induced correlations are sub-dominant. To quantify the robustness of the predicted boundaries, we will add a short sensitivity study in the revised manuscript: we numerically recompute the diagrams after introducing a small effective-temperature shift that mimics the leading-order correlation correction, and we show that the location of the rank-collapse and entropy-collapse lines changes by only a few percent within the range of residual scales we recommend. revision: yes

Circularity Check

0 steps flagged

Derivation self-contained via external REM analogy and independent gradient analysis

full rationale

The paper establishes its core results on signal propagation, trainability diagrams, and scaling prescriptions for weights and residuals by mapping the self-attention layer to the Random Energy Model from statistical physics, as described in the abstract. This mapping is presented as a formal parallel developed to overcome the challenge of exact treatment, without evidence of reducing to self-citations, fitted inputs renamed as predictions, or self-definitional loops. Backward gradient analysis is performed separately. No quoted equations or sections in the provided text show a prediction equivalent to its inputs by construction, and the framework is positioned as first-principles against the stated assumptions of layer-norm, skip connections, and MLP. The derivation therefore remains self-contained.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

Review performed on abstract only; the central claim rests on the validity of the signal-propagation assumptions and the REM analogy whose detailed justification appears only in the full manuscript.

axioms (1)

domain assumption Formal parallel between self-attention computation and the Random Energy Model allows exact treatment of attention statistics
Invoked to overcome the key challenge of an exact treatment of the self-attention layer (abstract).

pith-pipeline@v0.9.0 · 5765 in / 1358 out tokens · 39396 ms · 2026-05-22T02:38:35.327269+00:00 · methodology

discussion (0)

Forward citations

Cited by 3 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

A Unified Framework for Critical Scaling of Inverse Temperature in Self-Attention
stat.ML 2026-05 unverdicted novelty 8.0

The upper-tail accumulation scale derived from the gap-counting function N_n sets the critical inverse temperature for softmax attention concentration, unifying prior conflicting laws as special cases of different N_n.
Sink vs. diagonal patterns as mechanisms for attention switch and oversmoothing prevention
cs.LG 2026-05 unverdicted novelty 7.0

Sinks are equivalent to hard attention switches that zero out outputs and are cheaper than diagonal patterns when self-communication is allowed, closing the gap between oversmoothing prevention needs and what sinks provide.
Perceptrons and localization of attention's mean-field landscape
cs.LG 2026-01 unverdicted novelty 7.0

In the mean-field limit of attention with perceptron blocks, critical points of the energy landscape are generically atomic and localized on subsets of the unit sphere.

Reference graph

Works this paper leans on

20 extracted references · 20 canonical work pages · cited by 3 Pith papers · 3 internal anchors

[1]

Deep Information Propagation

Samuel S Schoenholz, Justin Gilmer, Surya Ganguli, and Jascha Sohl-Dickstein. Deep information propagation.arXiv preprint arXiv:1611.01232,

work page internal anchor Pith review Pith/arXiv arXiv
[2]

Shuangfei Zhai, Tatiana Likhomanenko, Etai Littwin, Dan Busbridge, Jason Ramapuram, Yizhe Zhang, Jiatao Gu, and Josh Susskind

URLhttps://arxiv.org/abs/2206.03126. Shuangfei Zhai, Tatiana Likhomanenko, Etai Littwin, Dan Busbridge, Jason Ramapuram, Yizhe Zhang, Jiatao Gu, and Josh Susskind. Stabilizing transformer training by preventing attention entropy collapse,

work page arXiv
[3]

Hongyu Wang, Shuming Ma, Li Dong, Shaohan Huang, Dongdong Zhang, and Furu Wei

URLhttps://arxiv.org/abs/2303.06296. Hongyu Wang, Shuming Ma, Li Dong, Shaohan Huang, Dongdong Zhang, and Furu Wei. Deepnet: Scaling transformers to 1,000 layers.IEEE Transactions on Pattern Analysis and Machine Intelligence,

work page arXiv
[4]

Geometric dynamics of signal propagation predict trainability of transformers.arXiv preprint arXiv:2403.02579,

Aditya Cowsik, Tamra Nebabu, Xiao-Liang Qi, and Surya Ganguli. Geometric dynamics of signal propagation predict trainability of transformers.arXiv preprint arXiv:2403.02579,

work page arXiv
[5]

Letrouit, Y

11 Published as a conference paper at ICLR 2026 Borjan Geshkovski, Cyril Letrouit, Yury Polyanskiy, and Philippe Rigollet. A mathematical perspec- tive on transformers.arXiv preprint arXiv:2312.10794, 2023a. Borjan Geshkovski, Cyril Letrouit, Yury Polyanskiy, and Philippe Rigollet. The emergence of clusters in self-attention dynamics.Advances in Neural In...

work page arXiv 2026
[6]

TinyStories: How Small Can Language Models Be and Still Speak Coherent English?

Ronen Eldan and Yuanzhi Li. Tinystories: How small can language models be and still speak coherent english?arXiv preprint arXiv:2305.07759,

work page internal anchor Pith review arXiv
[7]

Bert: Pre-training of deep bidirectional transformers for language understanding

Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. Bert: Pre-training of deep bidirectional transformers for language understanding. InProceedings of the 2019 conference of the North American chapter of the association for computational linguistics: human language technologies, volume 1 (long and short papers), pages 4171–4186,

work page 2019
[8]

Layer Normalization

URL https://api.semanticscholar.org/ CorpusID:122288449. Jimmy Lei Ba, Jamie Ryan Kiros, and Geoffrey E Hinton. Layer normalization.arXiv preprint arXiv:1607.06450,

work page internal anchor Pith review Pith/arXiv arXiv
[9]

and Hu, E

Greg Yang and Edward J Hu. Feature learning in infinite-width neural networks.arXiv preprint arXiv:2011.14522,

work page arXiv 2011
[10]

Shi Chen, Zhengjiang Lin, Yury Polyanskiy, and Philippe Rigollet

URLhttps://arxiv.org/abs/2410.23228. Shi Chen, Zhengjiang Lin, Yury Polyanskiy, and Philippe Rigollet. Quantitative clustering in mean-field transformer models.arXiv preprint arXiv:2504.14697,

work page arXiv
[11]

Mind the gap: a spectral analysis of rank collapse and signal propagation in transformers.arXiv preprint arXiv:2410.07799,

Alireza Naderi, Thiziri Nait Saada, and Jared Tanner. Mind the gap: a spectral analysis of rank collapse and signal propagation in transformers.arXiv preprint arXiv:2410.07799,

work page arXiv
[12]

A multiscale visualization of attention in the transformer model

12 Published as a conference paper at ICLR 2026 Jesse Vig. A multiscale visualization of attention in the transformer model. InProceedings of the 57th Annual Meeting of the Association for Computational Linguistics: System Demonstrations, pages 37–42, Florence, Italy, July

work page 2026
[13]

on a single-layer, single-head Transformer at initialization. The attention maps illustrate the effect of varying β (directly related to variance of queries/keys): in yellow, a model initialized with β= 0.1 (low-variance regime, resulting in approximately uniform attention distributions), and in red, a model initialized with β= 1.8 (high-variance regime, ...

work page 2026
[14]

X s ehats !n# = TX s1,...,sn=1 Ea

Although the scores att′ are individually Gaussian in the infinite width limit (d→ ∞ ), they are not independent; in fact, they are correlated. To quantify these correlations, we compute: Cov(ats, aτ σ) = 1 d dX i,j,k,l,m,n=1 XtiXskXτ lXσn E[(WQ)ji(WQ)ml]E[(W K)jk(WK)mn]. Since the query and key weights are independently initialised with variances σ2 Q =σ...

work page 2026
[15]

  (23) 15 Published as a conference paper at ICLR 2026 Figure 5: Theory and experiments (T= 10

work page 2026
[16]

X s A2 ts # , Y p(β) = lim T→∞ E

comparison of the computation ofY (2)(β), finite size effects are visible around the phase transition. Now we take the 1-RSB ansatz for Q: the n replicas are diveded into n x groups of x elements which are in the same energy configurations. Moreover, we need to consider exponentially long sequences, i.e. we take T=e N and control N. This implies: S(Q)≃e N...

work page 2026
[17]

B.3 FINITE-SIZEEFFECTS Here we give a non-rigorous argument on the finite size effects that afflict our asymptotic theory. In the low-β regime, the attention is spread approximately uniformly over a number T ∗ =e S(β,ρ) of keys, given by an entropic quantity S(β, ρ) = Φ(β, ρ)−β∂ βΦ(β, ρ) (where the free entropy Φ was defined in eq. (19)). A derivation of ...

work page 2009
[18]

Let’s write the Jacobian in components. Using the fact thatA=softmax(a), so the Jacobiam components are: D(ij),(kl) := ∂Aij ∂akl =δ ikδjlAij −δ ikAijAil.(37) Now, the trace can be written as tr = TX i,j,r,s=1 h ∂A ∂a (Q⊗Q) ∂A ∂a ⊤ (IT ⊗Q) i (ij),(tu) δitδju. Expanding indices leads to tr = TX i,j,k,l,m,n,r,s,t,u=1 D(ij),(kl) qkmqlnD(rs),(mn)δrtqsuδitδju S...

work page 2026
[19]

Figure 6: Training 30 layers of vanilla and Gain-controlled Transformer on TinyStories. Details of training: 30-layer, single-head BERT-style model with embedding size 480 and ReLU activation, using masked language modeling with 15% masking probability, a learning rate of 5e-4, batch size 64, warmup ratio 0.05, weight decay 0.01, for 0.5 epochs. C.2 VISUA...

work page 2012
[20]

24 Published as a conference paper at ICLR 2026 C.3 ENTROPY COLLAPSE CAN BE MITIGATED BY LOW LEARNING RATE Figure 8 is obtained with the same set-up as fig

Figure 7: Spectrum of a 512×512 self-attention matrix for various values of the query/key variance parameterβ. 24 Published as a conference paper at ICLR 2026 C.3 ENTROPY COLLAPSE CAN BE MITIGATED BY LOW LEARNING RATE Figure 8 is obtained with the same set-up as fig. 3 but with smaller learning rate (5e-4→1e-4). Figure 8: Entropy collapse can be partially...

work page 2026

[1] [1]

Deep Information Propagation

Samuel S Schoenholz, Justin Gilmer, Surya Ganguli, and Jascha Sohl-Dickstein. Deep information propagation.arXiv preprint arXiv:1611.01232,

work page internal anchor Pith review Pith/arXiv arXiv

[2] [2]

Shuangfei Zhai, Tatiana Likhomanenko, Etai Littwin, Dan Busbridge, Jason Ramapuram, Yizhe Zhang, Jiatao Gu, and Josh Susskind

URLhttps://arxiv.org/abs/2206.03126. Shuangfei Zhai, Tatiana Likhomanenko, Etai Littwin, Dan Busbridge, Jason Ramapuram, Yizhe Zhang, Jiatao Gu, and Josh Susskind. Stabilizing transformer training by preventing attention entropy collapse,

work page arXiv

[3] [3]

Hongyu Wang, Shuming Ma, Li Dong, Shaohan Huang, Dongdong Zhang, and Furu Wei

URLhttps://arxiv.org/abs/2303.06296. Hongyu Wang, Shuming Ma, Li Dong, Shaohan Huang, Dongdong Zhang, and Furu Wei. Deepnet: Scaling transformers to 1,000 layers.IEEE Transactions on Pattern Analysis and Machine Intelligence,

work page arXiv

[4] [4]

Geometric dynamics of signal propagation predict trainability of transformers.arXiv preprint arXiv:2403.02579,

Aditya Cowsik, Tamra Nebabu, Xiao-Liang Qi, and Surya Ganguli. Geometric dynamics of signal propagation predict trainability of transformers.arXiv preprint arXiv:2403.02579,

work page arXiv

[5] [5]

Letrouit, Y

11 Published as a conference paper at ICLR 2026 Borjan Geshkovski, Cyril Letrouit, Yury Polyanskiy, and Philippe Rigollet. A mathematical perspec- tive on transformers.arXiv preprint arXiv:2312.10794, 2023a. Borjan Geshkovski, Cyril Letrouit, Yury Polyanskiy, and Philippe Rigollet. The emergence of clusters in self-attention dynamics.Advances in Neural In...

work page arXiv 2026

[6] [6]

TinyStories: How Small Can Language Models Be and Still Speak Coherent English?

Ronen Eldan and Yuanzhi Li. Tinystories: How small can language models be and still speak coherent english?arXiv preprint arXiv:2305.07759,

work page internal anchor Pith review arXiv

[7] [7]

Bert: Pre-training of deep bidirectional transformers for language understanding

Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. Bert: Pre-training of deep bidirectional transformers for language understanding. InProceedings of the 2019 conference of the North American chapter of the association for computational linguistics: human language technologies, volume 1 (long and short papers), pages 4171–4186,

work page 2019

[8] [8]

Layer Normalization

URL https://api.semanticscholar.org/ CorpusID:122288449. Jimmy Lei Ba, Jamie Ryan Kiros, and Geoffrey E Hinton. Layer normalization.arXiv preprint arXiv:1607.06450,

work page internal anchor Pith review Pith/arXiv arXiv

[9] [9]

and Hu, E

Greg Yang and Edward J Hu. Feature learning in infinite-width neural networks.arXiv preprint arXiv:2011.14522,

work page arXiv 2011

[10] [10]

Shi Chen, Zhengjiang Lin, Yury Polyanskiy, and Philippe Rigollet

URLhttps://arxiv.org/abs/2410.23228. Shi Chen, Zhengjiang Lin, Yury Polyanskiy, and Philippe Rigollet. Quantitative clustering in mean-field transformer models.arXiv preprint arXiv:2504.14697,

work page arXiv

[11] [11]

Mind the gap: a spectral analysis of rank collapse and signal propagation in transformers.arXiv preprint arXiv:2410.07799,

Alireza Naderi, Thiziri Nait Saada, and Jared Tanner. Mind the gap: a spectral analysis of rank collapse and signal propagation in transformers.arXiv preprint arXiv:2410.07799,

work page arXiv

[12] [12]

A multiscale visualization of attention in the transformer model

12 Published as a conference paper at ICLR 2026 Jesse Vig. A multiscale visualization of attention in the transformer model. InProceedings of the 57th Annual Meeting of the Association for Computational Linguistics: System Demonstrations, pages 37–42, Florence, Italy, July

work page 2026

[13] [13]

on a single-layer, single-head Transformer at initialization. The attention maps illustrate the effect of varying β (directly related to variance of queries/keys): in yellow, a model initialized with β= 0.1 (low-variance regime, resulting in approximately uniform attention distributions), and in red, a model initialized with β= 1.8 (high-variance regime, ...

work page 2026

[14] [14]

X s ehats !n# = TX s1,...,sn=1 Ea

Although the scores att′ are individually Gaussian in the infinite width limit (d→ ∞ ), they are not independent; in fact, they are correlated. To quantify these correlations, we compute: Cov(ats, aτ σ) = 1 d dX i,j,k,l,m,n=1 XtiXskXτ lXσn E[(WQ)ji(WQ)ml]E[(W K)jk(WK)mn]. Since the query and key weights are independently initialised with variances σ2 Q =σ...

work page 2026

[15] [15]

  (23) 15 Published as a conference paper at ICLR 2026 Figure 5: Theory and experiments (T= 10

work page 2026

[16] [16]

X s A2 ts # , Y p(β) = lim T→∞ E

comparison of the computation ofY (2)(β), finite size effects are visible around the phase transition. Now we take the 1-RSB ansatz for Q: the n replicas are diveded into n x groups of x elements which are in the same energy configurations. Moreover, we need to consider exponentially long sequences, i.e. we take T=e N and control N. This implies: S(Q)≃e N...

work page 2026

[17] [17]

B.3 FINITE-SIZEEFFECTS Here we give a non-rigorous argument on the finite size effects that afflict our asymptotic theory. In the low-β regime, the attention is spread approximately uniformly over a number T ∗ =e S(β,ρ) of keys, given by an entropic quantity S(β, ρ) = Φ(β, ρ)−β∂ βΦ(β, ρ) (where the free entropy Φ was defined in eq. (19)). A derivation of ...

work page 2009

[18] [18]

Let’s write the Jacobian in components. Using the fact thatA=softmax(a), so the Jacobiam components are: D(ij),(kl) := ∂Aij ∂akl =δ ikδjlAij −δ ikAijAil.(37) Now, the trace can be written as tr = TX i,j,r,s=1 h ∂A ∂a (Q⊗Q) ∂A ∂a ⊤ (IT ⊗Q) i (ij),(tu) δitδju. Expanding indices leads to tr = TX i,j,k,l,m,n,r,s,t,u=1 D(ij),(kl) qkmqlnD(rs),(mn)δrtqsuδitδju S...

work page 2026

[19] [19]

Figure 6: Training 30 layers of vanilla and Gain-controlled Transformer on TinyStories. Details of training: 30-layer, single-head BERT-style model with embedding size 480 and ReLU activation, using masked language modeling with 15% masking probability, a learning rate of 5e-4, batch size 64, warmup ratio 0.05, weight decay 0.01, for 0.5 epochs. C.2 VISUA...

work page 2012

[20] [20]

24 Published as a conference paper at ICLR 2026 C.3 ENTROPY COLLAPSE CAN BE MITIGATED BY LOW LEARNING RATE Figure 8 is obtained with the same set-up as fig

Figure 7: Spectrum of a 512×512 self-attention matrix for various values of the query/key variance parameterβ. 24 Published as a conference paper at ICLR 2026 C.3 ENTROPY COLLAPSE CAN BE MITIGATED BY LOW LEARNING RATE Figure 8 is obtained with the same set-up as fig. 3 but with smaller learning rate (5e-4→1e-4). Figure 8: Entropy collapse can be partially...

work page 2026