pith. sign in

arxiv: 2605.20749 · v1 · pith:DI2TYUIYnew · submitted 2026-05-20 · 💻 cs.LG · cs.AI

The Devil is in the Condition Numbers: Why is GLU Better than non-GLU Structure?

Pith reviewed 2026-05-21 06:43 UTC · model grok-4.3

classification 💻 cs.LG cs.AI
keywords gated linear unitsneural tangent kernelcondition numbertraining dynamicsconvergence rategeneralizationlarge language models
0
0 comments X

The pith

GLU structures reshape the NTK spectrum to lower its condition number and speed up convergence.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper seeks to explain the consistent advantage of gated linear units over standard linear units in modern language models. It does so by examining two-layer networks in the neural tangent kernel regime, where the gating mechanism alters the spectrum of the kernel. The result is a smaller condition number and more compact eigenvalues, which improve the speed of training. The work also finds that this change does not substantially narrow the generalization gap, indicating the benefit is mostly about reaching low loss faster rather than achieving better final performance.

Core claim

The GLU structure reshapes the NTK spectrum, leading to a smaller condition number and a more compact eigenvalue distribution. This spectral property improves training dynamics, causing GLU models to converge faster than non-GLU models and producing a characteristic loss-crossing phenomenon. On models such as ViT and GPT-2, GLU shows limited effect on reducing the generalization gap, so its main value is accelerating optimization.

What carries the argument

The neural tangent kernel (NTK) spectrum of the model, whose condition number and eigenvalue distribution the GLU modifies to enable quicker gradient descent progress.

Load-bearing premise

The spectral effects observed in two-layer NTK analysis carry over to explain GLU advantages in deep nonlinear networks used in practice.

What would settle it

A direct computation showing that the NTK condition number remains unchanged or increases with GLU in a deep network would challenge the link between the two-layer analysis and real LLM performance.

Figures

Figures reproduced from arXiv: 2605.20749 by Peisong Wen, Qianqian Xu, Qingming Huang, Xingyu Lyu, Zhiyong Yang.

Figure 1
Figure 1. Figure 1: Illustration of our theoretical discovery. We find that (A) by adding the gating structure, (B) the NTK matrix of GLU structure becomes better conditioned, which (C) explains the faster optimization of GLU-based models. convergence of training process in the Neural Tangent Kernel (NTK) regime. While directly analyzing neural net￾work optimization is inherently challenging, the NTK frame￾work provides theor… view at source ↗
Figure 2
Figure 2. Figure 2: Comparison between theoretical and numerical NTK condition numbers for ReLU and ReGLU models. The theoreti￾cal predictions closely track the numerically computed condition numbers. ReLU GELU SiLU FFN Structure 0 10000 20000 30000 40000 50000 60000 Condition Number Non-GLU GLU (a) ReLU GELU SiLU FFN Structure 0 20 40 60 80 100 Condition Number Non-GLU GLU (b) [PITH_FULL_IMAGE:figures/full_fig_p006_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Condition number of (a) ViT and (b) GPT-2 under different activation choices. Gram matrix XX⊤/d. As is shown in Fig.4, this gating mechanism significantly enhances the diagonal domi￾nance of the kernel: diagonal entries become more pro￾nounced, while off-diagonal entries are suppressed. In fact, by definition, the off-diagonal entries of NTK matrix denotes the model gradient correlation between different s… view at source ↗
Figure 4
Figure 4. Figure 4: Visualization of the NTK matrices for ReLU (left) and ReGLU (right) models. 4. Training Dynamics in the Kernel Regime: From NTK Spectrum to Loss Crossing In previous section, we showed that GLU induces a more diagonally dominant NTK, leading to a smaller maximal eigenvalue and a larger minimal eigenvalue, and hence a more contracted spectrum. In this section, we show how this spectral reshaping leads to di… view at source ↗
Figure 5
Figure 5. Figure 5: Training trajectories in a two-sample toy model for ReLU and ReGLU. (a) 30 steps. (b) 70 steps. (c) 150 steps. (d) 1000 steps. 4.2. Loss-Crossing Phenomenon We now formalize the above intuition and connect it to the loss-crossing phenomenon. Our analysis is in the infinite￾width limit (m → ∞), where the model operates in the kernel regime, and the expected training loss admits a closed￾form expression. Pro… view at source ↗
Figure 6
Figure 6. Figure 6: Training loss curves on two-layer MLP models. (a) Gaussian data with learning rate 0.005. (b) MNIST with learning rate 1 × 10−5 . (c) Gaussian data with learning rate 0.008. (d) MNIST with learning rate 5 × 10−5 . 5. Generalization Gap Analysis Finally, we examine whether GLU variants help reduce the generalization gap. Intuitively, the multiplicative gating in GLU introduces second-order nonlinearity, whi… view at source ↗
Figure 8
Figure 8. Figure 8: Generalization gap vs training loss with MLP Mixer trained on Tiny ImageNet. (a) ReLU vs ReGLU. (b) GELU vs GEGLU. (c) SiLU vs SwiGLU. 2.5 3.0 3.5 4.0 4.5 5.0 Ltrain 0.0 0.1 0.2 0.3 0.4 0.5 0.6 L t e s t L t ra i n p-value: 0.1770 ViT on Tiny ImageNet ReGLU ReLU (a) 2.5 3.0 3.5 4.0 4.5 5.0 Ltrain 0.0 0.1 0.2 0.3 0.4 0.5 0.6 L t e s t L t ra i n p-value: 0.2080 ViT on Tiny ImageNet GEGLU GELU (b) 3.0 3.5 4.… view at source ↗
Figure 9
Figure 9. Figure 9: Generalization gap vs training loss with ViT trained on Tiny ImageNet. (a) ReLU vs ReGLU. (b) GELU vs GEGLU. (c) SiLU vs SwiGLU. 24 [PITH_FULL_IMAGE:figures/full_fig_p024_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: Generalization gap vs training loss with MLP Mixer trained on CIFAR-10. (a) ReLU vs ReGLU. (b) GELU vs GEGLU. (c) SiLU vs SwiGLU. 0.6 0.8 1.0 1.2 1.4 1.6 1.8 Ltrain 0.100 0.075 0.050 0.025 0.000 0.025 0.050 L t e s t L t ra i n p-value: 0.1310 ViT on CIFAR-10 ReGLU ReLU (a) 0.6 0.8 1.0 1.2 1.4 1.6 1.8 Ltrain 0.08 0.06 0.04 0.02 0.00 0.02 0.04 0.06 L t e s t L t ra i n p-value: 0.1270 ViT on CIFAR-10 GEGLU… view at source ↗
Figure 11
Figure 11. Figure 11: Generalization gap vs training loss with ViT trained on CIFAR-10. (a) ReLU vs ReGLU. (b) GELU vs GEGLU. (c) SiLU vs SwiGLU. 4.0 4.5 5.0 5.5 6.0 Ltrain 0.5 0.4 0.3 0.2 0.1 0.0 0.1 L v al L t r ain p-value: 0.0660 GPT-2 on FineWeb-Edu GELU GEGLU (a) 4.0 4.5 5.0 5.5 6.0 Ltrain 0.6 0.5 0.4 0.3 0.2 0.1 0.0 0.1 L v al L t r ain p-value: 0.4430 GPT-2 on FineWeb-Edu SiLU SwiGLU (b) [PITH_FULL_IMAGE:figures/full_… view at source ↗
Figure 12
Figure 12. Figure 12: Generalization gap vs training loss with GPT-2 trained on FineWeb-Edu. (a) GELU vs GEGLU. (b) SiLU vs SwiGLU. 25 [PITH_FULL_IMAGE:figures/full_fig_p025_12.png] view at source ↗
read the original abstract

Gated Linear Units (GLU) and their variants are widely adopted in modern open-source large language model architectures and consistently outperform their non-gated counterparts, yet the underlying reasons for this advantage remain unclear. In this work, we study GLU by analyzing two-layer networks in the neural tangent kernel (NTK) regime. Our analysis reveals that the GLU structure reshapes the NTK spectrum, leading to a smaller condition number and a more compact eigenvalue distribution. Building on this finding, we further analyze the resulting training dynamics and show how the reshaped spectrum leads to faster convergence of GLU models, including a characteristic loss-crossing phenomenon observed between GLU and non-GLU models. Finally, we empirically observe that GLU has limited impact in reducing the generalization gap on various models, including ViT and GPT-2, suggesting that its primary benefit lies in accelerating optimization rather than reducing the generalization gap.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper analyzes Gated Linear Units (GLU) in two-layer networks under the neural tangent kernel (NTK) regime, claiming that GLU reshapes the NTK spectrum to yield a smaller condition number and more compact eigenvalue distribution. This spectral change is then linked to faster convergence and a loss-crossing phenomenon in training dynamics. Empirical results on ViT and GPT-2 are presented to show that GLU has limited impact on reducing the generalization gap, suggesting its primary benefit is optimization speed rather than generalization.

Significance. If the NTK spectral reshaping and resulting condition-number reduction can be shown to persist or dominate in deeper, non-linear architectures with residuals, the work would offer a concrete theoretical account for the empirical superiority of GLU variants in modern LLMs, shifting focus from generalization to optimization dynamics. The reported generalization-gap experiments on ViT and GPT-2 provide useful negative evidence that strengthens the optimization-focused interpretation.

major comments (2)
  1. [Abstract and NTK analysis sections] The central derivation of spectrum reshaping, smaller condition number, and faster convergence is performed only for two-layer networks in the NTK regime (abstract and the analysis sections). The manuscript does not derive, bound, or empirically compute the NTK spectrum (or effective condition number) for any architecture deeper than two layers, leaving the extrapolation to deep LLMs (including the GPT-2 experiments) unverified and load-bearing for the stated explanation of GLU's advantage.
  2. [Training dynamics analysis] The training-dynamics claims, including the loss-crossing phenomenon, rest on the two-layer NTK spectrum result. Without additional analysis showing that the same spectral properties control convergence once depth, layer-wise non-linearities, and residual connections are introduced, the link between the derived condition-number improvement and observed behavior in modern models remains an assumption rather than a demonstrated mechanism.
minor comments (2)
  1. [Empirical section] Dataset details, error bars, and hyper-parameter choices for the ViT and GPT-2 generalization-gap experiments should be expanded to allow readers to assess whether post-hoc fitting affects the reported limited impact on generalization.
  2. [NTK derivation] Notation for the NTK eigenvalues and condition number should be made fully explicit (including any dependence on the gating parameters) to clarify that the reported improvement is not an artifact of parameter choices.

Simulated Author's Rebuttal

2 responses · 1 unresolved

We thank the referee for the constructive feedback and for highlighting the importance of verifying the NTK spectral properties in deeper architectures. We agree that our theoretical results are derived for two-layer networks and that extending them rigorously to deep models with residuals is a valuable direction for future work. Below, we address the major comments point by point.

read point-by-point responses
  1. Referee: [Abstract and NTK analysis sections] The central derivation of spectrum reshaping, smaller condition number, and faster convergence is performed only for two-layer networks in the NTK regime (abstract and the analysis sections). The manuscript does not derive, bound, or empirically compute the NTK spectrum (or effective condition number) for any architecture deeper than two layers, leaving the extrapolation to deep LLMs (including the GPT-2 experiments) unverified and load-bearing for the stated explanation of GLU's advantage.

    Authors: We acknowledge this limitation. Our analysis focuses on two-layer networks in the NTK regime to obtain closed-form insights into how GLU reshapes the spectrum and reduces the condition number. Extending this derivation to deeper networks is technically challenging due to the complexity of the NTK in the presence of residuals and multiple layers, and we do not claim to have done so. Instead, the two-layer case serves as a theoretical foundation, and we support the extrapolation with empirical evidence from training ViT and GPT-2 models, where GLU accelerates optimization without major generalization improvements. We will revise the manuscript to more explicitly state this scope and add a paragraph discussing the assumptions involved in applying the insights to modern LLMs. revision: partial

  2. Referee: [Training dynamics analysis] The training-dynamics claims, including the loss-crossing phenomenon, rest on the two-layer NTK spectrum result. Without additional analysis showing that the same spectral properties control convergence once depth, layer-wise non-linearities, and residual connections are introduced, the link between the derived condition-number improvement and observed behavior in modern models remains an assumption rather than a demonstrated mechanism.

    Authors: The loss-crossing is an empirical observation in our experiments on deeper models. The NTK analysis in two layers explains a plausible mechanism via improved conditioning leading to faster convergence. While we agree that demonstrating the same spectral control in deep residual networks would provide stronger evidence, our current contribution is to identify this mechanism in a tractable setting and show consistency with practice. We will update the discussion to clarify that the link is based on the simplified model and empirical corroboration, rather than a complete proof for all architectures. revision: partial

standing simulated objections not resolved
  • Deriving or bounding the NTK spectrum for deep residual networks with GLU and non-linear activations.

Circularity Check

0 steps flagged

NTK spectrum analysis for two-layer networks is derived independently without reducing to fitted inputs or self-citations

full rationale

The paper's core derivation analyzes the NTK for two-layer networks under the GLU activation to obtain the eigenvalue spectrum, condition number, and resulting convergence rates directly from the kernel expressions and activation properties. These steps are mathematical derivations rather than fits to the target advantage or self-referential definitions. No load-bearing self-citation chains, ansatzes smuggled via prior work, or renaming of known results appear in the described chain; the generalization-gap experiments are presented as separate empirical observations. The analysis is self-contained within its stated two-layer NTK regime and does not reduce the claimed spectrum reshaping to its own inputs by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the validity of the NTK approximation for two-layer networks and the assumption that spectrum properties directly govern convergence speed in the regimes studied. No free parameters or invented entities are mentioned in the abstract.

axioms (1)
  • domain assumption NTK regime holds for the two-layer networks under consideration
    Invoked to justify the spectrum analysis that produces the condition-number claim.

pith-pipeline@v0.9.0 · 5703 in / 1343 out tokens · 29798 ms · 2026-05-21T06:43:31.331333+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

17 extracted references · 17 canonical work pages · 1 internal anchor

  1. [1]

    The Llama 3 Herd of Models

    PMLR, 2017. de Ryck, T., Bonnet, F., Mishra, S., and de B ´ezenac, E. An operator preconditionning perspective on training in physics-informed machine learning. InInternational Con- ference on Learning Representation, 2024. Dey, R. and Salem, F. M. Gate-variants of gated recur- rent unit (gru) neural networks. In2017 IEEE 60th in- ternational midwest symp...

  2. [2]

    For ReLU model, theij-th element of its NTK matrix is approximately K=αXX ⊤ +βrr ⊤ +γD.(10) Hereα= 1 4 + m 4d , β= m 2πd , γ= 1 4 + m 4d − m 2πd

  3. [3]

    For ReGLU model, theij-th element of its NTK matrix is approximately ˜K= ˜α(XX⊤)⊙(XX ⊤) + ˜β(rr⊤)⊙(XX ⊤) + ˜γD2.(11) Here˜α= m 4d2 + 1 2d , ˜β= 1 2πd + m 2πd2 ,˜γ= 1 2d − 1 2πd + m 4d2 − m 2πd2 . Proof. The proof is carried out in two main steps. First, we derive a general expression for the NTK matrix that is independent of the specific activation functi...

  4. [4]

    Consider ReLU model first: z(x) =Vϕ(Wx)

    Obtaining the general form of NTK matrix. Consider ReLU model first: z(x) =Vϕ(Wx). Taking derivative w.r.t all parameters, we get: ∂z(x) ∂Vk =ϕ(W ⊤ k x), ∂z(x) ∂Wks =V kϕ′(W⊤ k x)xs, whereW k stands for thek-th row ofW. Hence we have: ⟨∇Vz(xi),∇ Vz(xj)⟩= mX k=1 ϕ(W⊤ k xi)ϕ(W⊤ k xj). ⟨∇Wz(xi),∇ Wz(xj)⟩= (x ⊤ i xj) mX k=1 V 2 k ϕ′(W⊤ k xi)ϕ′(W⊤ k xj). Takin...

  5. [5]

    Since we are considering ReLU activated models, we can use arc-cosine kernel to get rid of the expectation factors

    Using arc-cosine kernel and Taylor approximation. Since we are considering ReLU activated models, we can use arc-cosine kernel to get rid of the expectation factors. Specifically, we have (Cho & Saul, 2010), Ew[ϕ(w⊤xi)ϕ(w⊤xj)] = σ2 w∥xi∥∥xj∥ 2π q 1−ρ 2 ij + π−arccosρ ij ρij , Ew[ϕ′(w⊤xi)ϕ′(w⊤xj)] = 1 2π π−arccosρ ij . Hereρ ij := x⊤ i xj ∥xi∥∥xj ∥ is the ...

  6. [6]

    For ReLU model, theij-th element of its NTK matrix is approximately Kij =    m 2d + 1 2 ∥xi∥2, i=j; m 2πd + 1 4 + m 4d ρij + 1 2π + m 4πd ρ2 ij ∥xi∥∥xj∥, i̸=j. (14)

  7. [7]

    (15) Finally, for both models, we only keep the terms where ρij can be absorbed into x⊤ i xj

    For ReGLU model, theij-th element of its NTK matrix is approximately ˜Kij =    m 2d2 + 1 d ∥xi∥4, i=j; 1 2πd + m 2πd2 ρij + 1 2d + m 4d2 ρ2 ij ∥xi∥2∥xj∥2, i̸=j. (15) Finally, for both models, we only keep the terms where ρij can be absorbed into x⊤ i xj. That is, we drop the ρ2 ij term in ReLU model but keep it in ReGLU model. Then we obtain the fi...

  8. [8]

    For ReLU model, the largest eigenvalue of its NTK matrix is given by λ1(K)≈ m 2π ·n+ d 2 + (π−1)m 2π

  9. [9]

    Therefore, λ1(K) = Θ(mn), λ 1( ˜K) = Θ(mn/d)

    For ReGLU model, the largest eigenvalue of its NTK matrix is given by m 4d + 1 2 n+ m 2 − m 2π +d− d 2π ≲λ 1( ˜K)≲ m 4d + m 2πd + 1 2 + 1 2π n+ m 2 +d. Therefore, λ1(K) = Θ(mn), λ 1( ˜K) = Θ(mn/d). Proof.1) ReLU model. For ReLU model, we note that by law of large numbers, the expression can be approximately written as: K≈αXX ⊤ +βd11 ⊤ +γdI. This form of e...

  10. [10]

    For ReGLU model, similarly, we have: ˜K≈˜α(XX⊤)⊙(XX ⊤) + ˜βdXX⊤ + ˜γd2I

    ReGLU model. For ReGLU model, similarly, we have: ˜K≈˜α(XX⊤)⊙(XX ⊤) + ˜βdXX⊤ + ˜γd2I. We know thatXX⊤ is positive semi-definite. Therefore, by Schur product theorem (Horn & Johnson, 2012, Theorem 7.5.3), (XX⊤)⊙(XX ⊤)is also positive semi-definite. Hence, by Weyl’s inequality (Thm.B.3), we have: max n λ1(˜α(XX⊤)⊙(XX ⊤)), λ1( ˜βdXX⊤) o ≤λ 1( ˜K)−˜γd2 ≤λ 1(˜...

  11. [11]

    For ReLU model, the smallest eigenvalue of its NTK matrix is given by λn(K)≈ m+d 4 (s2 + 1)− m 2π , wheres:= max n 0,1− p n/d o

  12. [12]

    Therefore, bothλ n(K)andλ n( ˜K)is of orderΘ(m+d)

    For ReGLU model, the smallest eigenvalue of its NTK matrix is given by λn( ˜K)≈ m+ 2d 4 (˜s2 + 1) + m+d 2π (s2 −1), 18 The Devil is in the Condition Numbers where˜s= max 0,1− √ 2n/d . Therefore, bothλ n(K)andλ n( ˜K)is of orderΘ(m+d). Whenn > d, we have that λn( ˜K)> λ n(K). Proof.1) ReLU model. For ReLU mdoel, we recall that its NTK matrix can be written...

  13. [13]

    For ReGLU model, recall that ˜K= ˜α(XX⊤)⊙(XX ⊤) + ˜β(rr⊤)⊙(XX ⊤) + ˜γD2.(17) 19 The Devil is in the Condition Numbers First, we notice thatXX ⊤ is positive semidefinite

    ReGLU model. For ReGLU model, recall that ˜K= ˜α(XX⊤)⊙(XX ⊤) + ˜β(rr⊤)⊙(XX ⊤) + ˜γD2.(17) 19 The Devil is in the Condition Numbers First, we notice thatXX ⊤ is positive semidefinite. To see this, consider the Rayleigh quotient: λn(XX⊤) = min ∥v∥=1 v⊤XX⊤v=∥X ⊤v∥2 ≥0. Similarly, rr⊤ is also positive semidefinite. By Thm.B.7, rr⊤ ⊙XX ⊤ is also positive semid...

  14. [14]

    lazy training

    Comparing two eigenvalues. Finally, we compare the two smallest eigenvalues. Subtracting one from another, we have λn( ˜K)−λ n(K)≈ m+ 2d 4 ˜s2 + 1 4 − 1 2π [d−(m+d)s 2]. Becausen > d,s= 0. Therefore, λn( ˜K)−λ n(K)≈ m+ 2d 4 ˜s2 + 1 4 − 1 2π d >0. C. Proof of Loss Crossing Results C.1. Proof of Prop.4.1 Before proving Prop.4.1, we first prove the following...

  15. [15]

    That is, ReLU model converges faster than ReGLU model

    At early stage when (ηk) is small, as long as Y⊤(K− ˜K)Y≥0, d≥5 and n≥300 , it holds that Eθ[Lk − ˜Lk]<0 . That is, ReLU model converges faster than ReGLU model

  16. [16]

    That is, ReGLU model takes over and converges faster than ReGLU model

    At later stage when the minimum eigenvalue λmin dominates the training process, for sufficiently largek, it holds that Eθ[Lk − ˜Lk]>0. That is, ReGLU model takes over and converges faster than ReGLU model. Proof.1) Early stage. We first consider the early stage when(ηk)is relatively small. Expanding the expression in Prop.4.1, we obtain Eθ[Lk]∝Tr[(I−ηK) 2...

  17. [17]

    To further analyze the dynamics at later stage, we use eigendecomposition

    Later stage. To further analyze the dynamics at later stage, we use eigendecomposition. Denote by (λi,v i) the eigenpairs of K and βi :=Y ⊤vi, for ReLU model, we have that: Eθ[Lk]∝Tr[(I−ηK) 2kK] +Y ⊤(I−ηK) 2kY ∝ nX i=1 (λi +β 2 i )(1−ηλ i)2k. At later stage whenk→ ∞,(1−ηλ n)2k demonates the above expression. We have, Eθ[Lk]∝(λ n +β 2 n)(1−ηλ n)2k nX i=1 λ...