arxiv: 2604.22778 · v1 · submitted 2026-04-03 · 💻 cs.LG · cs.AI

Recognition: no theorem link

The Spectral Lifecycle of Transformer Training: Transient Compression Waves, Persistent Spectral Gradients, and the Q/K--V Asymmetry

Yi Liu

Authors on Pith no claims yet

Pith reviewed 2026-05-13 20:53 UTC · model grok-4.3

classification 💻 cs.LG cs.AI

keywords transformer pretrainingsingular value spectraspectral gradientscompression wavesQKV asymmetrylayer pruningpower-law exponentsdepth dependence

0 comments

The pith

Transformer weight spectra develop traveling compression waves and stable depth gradients that separate rank from shape and forecast layer importance.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper follows the full singular-value spectra of every weight matrix at 25-step intervals through pretraining of models from 30M to 285M parameters. Transient compression waves move from early to late layers and then reverse, while the power-law exponent alpha settles into a lasting non-monotonic depth gradient. Query and key projections carry nearly all of the depth dependence, whereas value and output projections compress uniformly. The separation shows that matrix rank and spectral shape track distinct aspects of training, formalized in a two-timescale model that yields scaling laws and enables spectral measures to predict which layers matter for pruning.

Core claim

During transformer pretraining the singular-value spectra of weight matrices exhibit transient traveling compression waves that propagate from early to late layers before reversing, together with persistent non-monotonic depth gradients in the power-law exponent alpha; query and key projections drive the full depth-dependent dynamics while value and output projections compress uniformly, establishing that rank and spectral shape encode fundamentally different information about training and supporting a two-timescale dynamical model with scaling laws Delta alpha proportional to L to the 0.26.

What carries the argument

The power-law exponent alpha of the singular-value spectrum, which encodes the persistent shape of each weight matrix across depth and time while rank tracks only the transient compression.

If this is right

Alpha develops a permanent depth gradient that forms a non-monotonic inverted-U in deeper models.
Changes in alpha scale with model depth as L to the power 0.26 with R squared equal to 0.99.
Alpha values predict layer importance with Spearman correlations of 0.69 to 0.84.
Spectral-guided pruning outperforms Last-N position heuristics by factors of 1.1x to 3.6x across seven models.
Performance gaps between worst and best layer choices reach up to 23.7x, confirming the causal role of spectral structure.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The two-timescale separation implies that early training primarily reduces rank while later stages stabilize spectral distributions that may govern information flow between layers.
The Q/K-V asymmetry suggests attention mechanisms functionally separate query-key matching dynamics from value projection, a split that could be tested in non-transformer sequence models.
Monitoring alpha during training could provide an early diagnostic of layer health that operates independently of loss or accuracy curves.
The scaling law for Delta alpha offers a way to predict spectral gradients in larger models without running full training trajectories.

Load-bearing premise

The observed compression waves and spectral gradients are intrinsic to transformer training dynamics rather than artifacts of the specific optimizer, data mixture, or initialization used in the nine evaluated models.

What would settle it

Training an identical transformer architecture with plain SGD instead of Adam on a non-language dataset and checking whether the same transient waves and alpha depth gradients appear would test the claim; their absence would falsify the intrinsic-dynamics interpretation.

Figures

Figures reproduced from arXiv: 2604.22778 by Yi Liu.

**Figure 1.** Figure 1: Transient compression waves. (a) Spatiotemporal heatmaps show the diagonal wave front. (b) The compression gradient transitions from negative to positive in all models, with D16 showing the largest magnitude swings. 3.2 Experimental Setup We train GPT-2-style transformers at three scales ( [PITH_FULL_IMAGE:figures/full_fig_p003_1.png] view at source ↗

**Figure 2.** Figure 2: Persistent spectral gradients. (a) D16 Q-projection α across layers showing the inverted-U. (b) Cross-model comparison on normalized depth: the peak shifts earlier in deeper models. The Dissociation [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗

**Figure 3.** Figure 3: The two-timescale dissociation. (a) Layers reach similar compression but divergent α. (b) SR gradient reverses while α gradient monotonically strengthens—the hallmark of two-timescale dynamics. shape function ψ(l) depends primarily on layer position. The shared equilibrium R∗ s is approximately constant (same information bottleneck), while α ∗ (l) varies by layer (different computational roles). This model… view at source ↗

**Figure 4.** Figure 4: Cross-family Q-α profiles (normalized depth). (a) Custom models: inverted-U with peak shifting left. (b) GPT-2: early-layer concentration after extended training. (c) Pythia: late-layer peaks at intermediate training. All show non-zero α gradients. 6 Validation on Pretrained Models 6.1 GPT-2 Family (124M–774M) We analyze GPT-2 Small (12L), Medium (24L), and Large (36L) [Radford et al., 2019]—trained on ∼40… view at source ↗

**Figure 5.** Figure 5: Cross-family pruning synthesis (7 models, 2 families). (a) Last-N/Spectral ratio at maximum k: spectral wins on 5/7 models. (b) ∆PPL at k=2 across all models. (c) Topology dependence: the α-peak position modulates pruning advantage. Models with mid-network peaks (GPT-2 Medium, Pythia-1B) show the largest spectral advantage. Limitations. Full SVD tracking at multi-billion scale requires efficient approximat… view at source ↗

**Figure 6.** Figure 6: shows the per-matrix-type comparison. 0 2000 4000 6000 8000 10000 Training Step 50 100 150 200 Stable Rank D8 L0 SR by Matrix Type Q K V O MLP MLP 0 1 2 3 4 5 6 7 Layer 0 20 40 60 80 100 Compression % D8 Final Compression by Type Q K V O [PITH_FULL_IMAGE:figures/full_fig_p013_6.png] view at source ↗

**Figure 7.** Figure 7: Multi-seed reproducibility. All qualitative phenomena (compression waves, [PITH_FULL_IMAGE:figures/full_fig_p013_7.png] view at source ↗

**Figure 8.** Figure 8: Spectral entropy gradient mirrors α: lower entropy (more concentrated spectra) at early-middle layers. This independently confirms the persistent gradient. F Spectral Warmup: Full Details Spectral Warmup initializes each weight matrix as W0 = Urand · diag(s · σ ∗ ) · V⊤ rand where σ ∗ are target singular values from a trained reference model and Urand, Vrand are random orthogonal matrices [PITH_FULL_IMAGE… view at source ↗

**Figure 9.** Figure 9: Layer importance via ablation. Color = α. Among core layers, higher α → higher importance. 0.20 0.25 0.30 0.35 0.40 0.45 Weighted 10 1 10 0 L oss L0 L1 L2 L3 L6 L5 L4 L7 D8 (30.4M) 0.25 0.30 0.35 0.40 0.45 0.50 Weighted 10 2 10 1 10 0 L oss L0 L1 L2 L3L4 L5 L6 L7 L8 L9 L10 L11 D12 (92.8M) 0.25 0.30 0.35 0.40 0.45 0.50 0.55 Weighted 10 2 10 1 10 0 10 1 L oss L0 L1 L2 L4L3 L5 L6 L7 L8 L9 L10 L12 L11 L13 L14 … view at source ↗

**Figure 10.** Figure 10: Layer importance–α scatter plots. Boundary layers are structural outliers; core layers show positive α–importance correlation. H Cross-Family Synthesis This appendix consolidates the full cross-family comparison that is summarized in [PITH_FULL_IMAGE:figures/full_fig_p015_10.png] view at source ↗

**Figure 11.** Figure 11: GPT-2 family pruning comparison. Left: Small (12L)—spectral and Last-N overlap at [PITH_FULL_IMAGE:figures/full_fig_p017_11.png] view at source ↗

**Figure 12.** Figure 12: GPT-2 Large (36L, 774M) spectral pruning analysis. Left: [PITH_FULL_IMAGE:figures/full_fig_p018_12.png] view at source ↗

**Figure 13.** Figure 13: Pythia-160M (12L) zone-aware spectral pruning. (a) [PITH_FULL_IMAGE:figures/full_fig_p018_13.png] view at source ↗

**Figure 14.** Figure 14: Pythia-1B (16L, 1B) zone-aware spectral pruning. (a) [PITH_FULL_IMAGE:figures/full_fig_p020_14.png] view at source ↗

**Figure 15.** Figure 15: Pythia-410M (24L) pruning with crossover. (a) Monotonically rising [PITH_FULL_IMAGE:figures/full_fig_p020_15.png] view at source ↗

**Figure 16.** Figure 16: D8 per-layer Q-projection SR trajectories. Each color is one layer. [PITH_FULL_IMAGE:figures/full_fig_p021_16.png] view at source ↗

**Figure 17.** Figure 17: D8 per-layer Q-projection α trajectories. Layers diverge early and never reconverge. 22 [PITH_FULL_IMAGE:figures/full_fig_p022_17.png] view at source ↗

**Figure 18.** Figure 18: α–loss correlation across models. 0 2000 4000 6000 8000 Training Step 0 50 100 150 200 250 Stable Rank D16 Stable Rank L0 L2 L5 L10 L15 0 2000 4000 6000 8000 Training Step 200 150 100 50 0 SR Gradient D16 Compression Gradient 0 2000 4000 6000 8000 Training Step 0.1 0.2 0.3 0.4 0.5 Weighted Alpha ( ) D16 Alpha Trajectories L0 L2 L5 L10 L15 [PITH_FULL_IMAGE:figures/full_fig_p023_18.png] view at source ↗

**Figure 19.** Figure 19: D16 three-panel dynamics. 23 [PITH_FULL_IMAGE:figures/full_fig_p023_19.png] view at source ↗

**Figure 20.** Figure 20: D16 matrix type heatmap. 0 1 2 3 4 5 6 7 8 9 10 11 Layer Index 0.25 0.30 0.35 0.40 0.45 0.50 Final Weighted Alpha ( ) Peak: L4 D12 (92.8M) Profile (Inverted-U) =0.516 [PITH_FULL_IMAGE:figures/full_fig_p024_20.png] view at source ↗

**Figure 21.** Figure 21: D12 α profile (inverted-U). 24 [PITH_FULL_IMAGE:figures/full_fig_p024_21.png] view at source ↗

**Figure 22.** Figure 22: Scaling laws: (a) ∆α vs. depth, (b) α extremes, (c) peak position, (d) wave velocity. 25 [PITH_FULL_IMAGE:figures/full_fig_p025_22.png] view at source ↗

**Figure 23.** Figure 23: Pruning strategy comparison (line plots). [PITH_FULL_IMAGE:figures/full_fig_p026_23.png] view at source ↗

**Figure 24.** Figure 24: Pruning strategy comparison (bar charts). [PITH_FULL_IMAGE:figures/full_fig_p026_24.png] view at source ↗

read the original abstract

We present the first systematic study of weight matrix singular value spectra \emph{during} transformer pretraining, tracking full SVD decompositions of every weight matrix at 25-step intervals across three model scales (30M--285M parameters). We discover three phenomena: \textbf{(1)~Transient Compression Waves:} stable rank compression propagates as a traveling wave from early to late layers, creating a dramatic gradient that peaks early then \emph{reverses} -- late layers eventually over-compress past early layers. \textbf{(2)~Persistent Spectral Gradients:} the power-law exponent~$\alpha$ develops a permanent depth gradient forming a non-monotonic inverted-U in deeper models, with peaks shifting toward earlier layers as depth increases. \textbf{(3)~Q/K--V Functional Asymmetry:} value/output projections compress uniformly while query/key projections carry the full depth-dependent dynamics. The dissociation between transient compression and persistent spectral shape reveals that \emph{rank and spectral shape encode fundamentally different information about training}. We formalize this as a two-timescale dynamical model and derive scaling laws ($\Delta\alpha \propto L^{0.26}$, $R^2{=}0.99$). We validate on nine models across three families (custom, GPT-2, Pythia; 30M--1B parameters; 8--36 layers), demonstrate that $\alpha$ predicts layer importance ($\rho{=}0.69$--$0.84$, $p{<}0.02$), and show that spectral-guided pruning outperforms Last-N heuristics by $1.1{\times}$--$3.6{\times}$ across seven models in two families (GPT-2 124M--774M, Pythia 160M--1B), with worst-vs-best gaps up to $23.7{\times}$ confirming the causal role of spectral structure.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

They tracked full SVD spectra every 25 steps across scales and found traveling compression waves plus a stable depth-dependent alpha that predicts pruning better than last-N, but everything runs on AdamW plus web data so the dynamics could be setup-specific.

read the letter

The new part is the fine-grained temporal tracking of entire singular-value spectra during pretraining rather than just final models. They see rank compression moving as a wave from early to late layers, then reversing, while the power-law exponent alpha settles into a non-monotonic depth profile that stays put. Query and key matrices carry most of the depth dependence; value and output matrices compress more uniformly. That split is the clearest observation and it lines up with the pruning gains they report: alpha correlates with layer importance at 0.69-0.84 and spectral pruning beats last-N by 1.1-3.6x on held-out runs. The scaling relation they fit for the alpha gradient also looks clean on the nine models they checked. The soft spot is that all nine models (custom, GPT-2, Pythia) use AdamW, similar web-scale data, and standard initialization. No runs with SGD, different data mixtures, or altered init schemes, so the traveling wave and the inverted-U alpha could be tied to that shared recipe rather than universal transformer behavior. The scaling exponent 0.26 is presented as derived but reads like a direct fit to the depth trend they already measured. If the full methods section shows they held out the scaling fit from the pruning tests, that helps; otherwise it stays circular. This is worth a serious referee because the observational phenomena are new, the pruning numbers are concrete, and the two-timescale framing is testable. A reader working on layer-wise analysis or compression would get immediate use from the alpha statistic even if the generality claim needs more ablations.

Referee Report

2 major / 2 minor

Summary. The paper conducts a systematic analysis of singular value spectra of weight matrices during transformer pretraining across multiple scales and families. It reports three main phenomena: transient compression waves that propagate as traveling waves through layers and eventually reverse; persistent spectral gradients where the power-law exponent alpha forms a non-monotonic inverted-U depth dependence; and Q/K-V asymmetry where value/output projections compress uniformly while query/key carry the dynamics. The work formalizes a two-timescale model, derives scaling laws such as Δα ∝ L^0.26 (R²=0.99), shows alpha correlates with layer importance (ρ=0.69-0.84), and demonstrates spectral-guided pruning outperforming Last-N heuristics by 1.1x-3.6x.

Significance. If the reported dynamics prove intrinsic rather than setup-specific, the dissociation between transient rank compression and persistent spectral shape would provide a new framework for understanding transformer training, with the scaling law and pruning results offering both theoretical and practical value. The multi-model validation and high reported correlations strengthen the empirical case for alpha as a distinct predictor of layer utility.

major comments (2)

[Abstract and scaling-law section] Abstract and scaling-law section: the claim that Δα ∝ L^0.26 is 'derived' from the two-timescale model is not supported by explicit derivation steps or parameter-free reasoning; the exponent appears to be obtained by direct fitting to the observed depth dependence across the nine models, raising questions about whether the R²=0.99 reflects predictive power or post-hoc adjustment.
[Methods and validation sections] Methods and validation sections: all nine models (custom, GPT-2, Pythia; 30M-1B) share AdamW, comparable web-scale data mixtures, and standard initializations. No ablations vary optimizer, data distribution, or initialization, so the traveling-wave compression, inverted-U alpha gradient, Q/K-V asymmetry, and derived scaling law could be artifacts of the shared setup rather than intrinsic two-timescale behavior; this directly affects the generality of the central claim.

minor comments (2)

[Abstract] Abstract: no error bars, confidence intervals, or data-exclusion criteria are reported for the R² values, correlations, or pruning gains, making it difficult to assess statistical robustness of the claimed 0.26 exponent and 1.1x-3.6x improvements.
[Pruning experiments] Pruning experiments: the spectral-guided method is validated on held-out models but still uses the alpha statistic whose depth dependence was characterized on the same training runs; additional controls (e.g., random or rank-based baselines matched for compute) would clarify whether the gains are truly due to spectral structure.

Simulated Author's Rebuttal

2 responses · 1 unresolved

We thank the referee for the detailed and constructive report. The two major comments raise important issues about the precise status of our scaling law and the generality of the reported phenomena. We address each point below, have revised the manuscript where feasible, and note one standing limitation.

read point-by-point responses

Referee: The claim that Δα ∝ L^0.26 is 'derived' from the two-timescale model is not supported by explicit derivation steps or parameter-free reasoning; the exponent appears to be obtained by direct fitting to the observed depth dependence across the nine models, raising questions about whether the R²=0.99 reflects predictive power or post-hoc adjustment.

Authors: We agree that the specific exponent 0.26 is obtained by least-squares fitting to the measured Δα values across the nine models rather than by a closed-form derivation from the two-timescale equations. The model supplies the functional form (power-law dependence on depth) and the expectation of a positive exponent, but does not predict the numerical coefficient without additional assumptions. The reported R²=0.99 quantifies the quality of the fit to the observed data. In the revised manuscript we have changed the wording in the abstract and scaling-law section from 'derive scaling laws' to 'empirically measure scaling laws consistent with the two-timescale model', added the explicit fitting procedure and model equations to the methods appendix, and clarified that the high R² describes in-sample fit rather than out-of-sample prediction. revision: yes
Referee: All nine models share AdamW, comparable web-scale data mixtures, and standard initializations. No ablations vary optimizer, data distribution, or initialization, so the traveling-wave compression, inverted-U alpha gradient, Q/K-V asymmetry, and derived scaling law could be artifacts of the shared setup rather than intrinsic two-timescale behavior; this directly affects the generality of the central claim.

Authors: This is a valid concern. While the three phenomena appear consistently across custom, GPT-2, and Pythia families spanning 30M–1B parameters, the shared optimizer, data mixture, and initialization leave open the possibility of setup-specific effects. We have added a new limitations paragraph in the discussion section that explicitly acknowledges the absence of optimizer/data ablations and outlines the computational cost of such experiments as future work. The multi-family replication still provides supporting evidence, but does not substitute for controlled ablations. revision: partial

standing simulated objections not resolved

Full demonstration that the traveling-wave compression, inverted-U spectral gradient, and Q/K-V asymmetry are independent of optimizer choice, data distribution, and initialization would require new large-scale ablations that are outside the scope of the current revision.

Circularity Check

1 steps flagged

Scaling law presented as 'derived' reduces to direct empirical fit on observed depth dependence

specific steps

fitted input called prediction [Abstract (scaling laws paragraph) and two-timescale model section]
"We formalize this as a two-timescale dynamical model and derive scaling laws (Δα ∝ L^{0.26}, R²=0.99)."

The specific exponent 0.26 and near-perfect R²=0.99 are obtained by fitting a power law directly to the measured Δα versus depth L across the nine models; the 'derivation' therefore reproduces the input observations by construction rather than predicting them from independent first principles of the dynamical model.

full rationale

The central derivation chain claims a two-timescale dynamical model yields scaling laws, but the quoted law is a post-hoc power-law fit to the same depth-dependent alpha observations used to motivate the model. This matches the fitted-input-called-prediction pattern exactly. Pruning superiority is validated on held-out models yet still depends on the alpha statistic whose functional form was tuned on the primary training runs, creating partial circularity in the causal claim. No self-citation load-bearing steps or ansatz smuggling are present; the remainder of the spectral observations and Q/K-V asymmetry are direct measurements without reduction to inputs. The overall score reflects one load-bearing reduction rather than wholesale circularity.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The central claims rest on the assumption that SVD spectra of weight matrices are stable enough to track at 25-step intervals and that power-law fits to the singular-value tail are meaningful descriptors of training state. No new physical entities are introduced. The scaling exponent 0.26 is treated as an empirical constant derived from the data.

free parameters (1)

scaling_exponent_0.26
The relation Delta alpha proportional to L^0.26 is reported with R^2=0.99; the numerical prefactor is obtained by fitting across the observed model depths.

axioms (1)

domain assumption Singular value spectra of weight matrices remain well-defined and comparable across training checkpoints and model scales.
Invoked when performing full SVD at every 25-step interval on models from 30M to 1B parameters.

pith-pipeline@v0.9.0 · 5646 in / 1578 out tokens · 40095 ms · 2026-05-13T20:53:53.792740+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

20 extracted references · 20 canonical work pages · 7 internal anchors

[1]

Workshop paper,https://opt-ml.org/papers/2025/paper43.pdf. Stella Biderman, Hailey Schoelkopf, Quentin Gregory Anthony, Herbie Bradley, Kyle O’Brien, Eric Hallahan, Moham- mad Aflah Khan, Shivanshu Purohit, USVSN Sai Prashanth, Edward Raff, et al. Pythia: A suite for analyzing large language models across training and scaling.Proceedings of the 40th Inter...

work page 2025
[2]

arXiv preprint arXiv:2103.00065 , year=

Jeremy Cohen, Simran Kaur, Yuanzhi Li, J Zico Kolter, and Ameet Talwalkar. Gradient descent on neural networks typically occurs at the edge of stability.arXiv preprint arXiv:2103.00065,

work page arXiv
[3]

Transformer feed-forward layers build predictions by promoting concepts in the vocabulary space.arXiv preprint arXiv:2203.14680,

Mor Geva, Avi Caciularu, Kevin Wang, and Yoav Goldberg. Transformer feed-forward layers build predictions by promoting concepts in the vocabulary space.arXiv preprint arXiv:2203.14680,

work page arXiv
[4]

Mamba: Linear-Time Sequence Modeling with Selective State Spaces

9 Albert Gu and Tri Dao. Mamba: Linear-time sequence modeling with selective state spaces.arXiv preprint arXiv:2312.00752,

work page internal anchor Pith review Pith/arXiv arXiv
[5]

Training Compute-Optimal Large Language Models

Jordan Hoffmann, Sebastian Borgeaud, Arthur Mensch, Elena Buchatskaya, Trevor Cai, Eliza Rutherford, Diego de Las Casas, Lisa Anne Hendricks, Johannes Welbl, Aidan Clark, et al. Training compute-optimal large language models. arXiv preprint arXiv:2203.15556,

work page internal anchor Pith review Pith/arXiv arXiv
[6]

Early-stopping for transformer model training via spectral analysis

Yijin Huang, Yuyan Zheng, and Weizhong Li. Early-stopping for transformer model training via spectral analysis. arXiv preprint arXiv:2510.16074,

work page arXiv
[7]

The break-even point on optimization trajectories of deep neural networks.arXiv preprint arXiv:2002.09572,

Stanislaw Jastrzebski, Maciej Szymczak, Stanislav Fort, Devansh Arpit, Jacek Tabber, Kyunghyun Cho, and Krzysztof Geras. The break-even point on optimization trajectories of deep neural networks.arXiv preprint arXiv:2002.09572,

work page arXiv 2002
[8]

Scaling Laws for Neural Language Models

Jared Kaplan, Sam McCandlish, Tom Henighan, Tom B Brown, Benjamin Chess, Rewon Child, Scott Gray, Alec Radford, Jeffrey Wu, and Dario Amodei. Scaling laws for neural language models.arXiv preprint arXiv:2001.08361,

work page internal anchor Pith review Pith/arXiv arXiv 2001
[9]

Shortgpt: Layers in large language models are more redundant than you expect.arXiv preprint arXiv:2403.03853,

Xin Men, Mingyu Xu, Qingyu Zhang, Bingning Wang, Hongyu Lin, Yaojie Lu, Xianpei Han, and Weipeng Chen. Shortgpt: Layers in large language models are more redundant than you expect.arXiv preprint arXiv:2403.03853,

work page arXiv
[10]

Deep double descent: Where bigger models and more data can hurt.Journal of Statistical Mechanics: Theory and Experiment, 2021(12): 124003,

Preetum Nakkiran, Gal Kaplun, Yamini Bansal, Tristan Yang, Boaz Barak, and Ilya Sutskever. Deep double descent: Where bigger models and more data can hurt.Journal of Statistical Mechanics: Theory and Experiment, 2021(12): 124003,

work page 2021
[11]

From sgd to spectra: A theory of neural network weight dynamics.arXiv preprint arXiv:2507.12709,

Brian Richard Olsen, Sam Fatehmanesh, Frank Xiao, Adarsh Kumarappan, and Anirudh Gajula. From sgd to spectra: A theory of neural network weight dynamics.arXiv preprint arXiv:2507.12709,

work page arXiv
[12]

In-context Learning and Induction Heads

Catherine Olsson, Nelson Elhage, Neel Nanda, Nicholas Joseph, Nova DasSarma, Tom Henighan, Ben Mann, Amanda Askell, Yuntao Bai, Anna Chen, et al. In-context learning and induction heads.arXiv preprint arXiv:2209.11895,

work page internal anchor Pith review Pith/arXiv arXiv
[13]

Grokking: Generalization Beyond Overfitting on Small Algorithmic Datasets

Alethea Power, Yuri Burda, Harri Edwards, Igor Babuschkin, and Vedant Misra. Grokking: Generalization beyond overfitting on small algorithmic datasets.arXiv preprint arXiv:2201.02177,

work page internal anchor Pith review Pith/arXiv arXiv
[14]

Small singular values matter: A random matrix analysis of transformer weight matrices.arXiv preprint arXiv:2410.17770,

Max Staats, Matthias Thamm, and Bernd Rosenow. Small singular values matter: A random matrix analysis of transformer weight matrices.arXiv preprint arXiv:2410.17770,

work page arXiv
[15]

Llama 2: Open Foundation and Fine-Tuned Chat Models

Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Amjad Almahairi, Yasmine Babaei, Nikolay Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti Bhosale, et al. Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288,

work page internal anchor Pith review Pith/arXiv arXiv
[16]

The Norm-Separation Delay Law of Grokking: A First-Principles Theory of Delayed Generalization

Xuan Khanh Truong, Quynh Hoa Truong, Duc Trung Luu, and Thanh Duc Phan. Why grokking takes so long: A first-principles theory of representational phase transitions.arXiv preprint arXiv:2603.13331,

work page internal anchor Pith review Pith/arXiv arXiv
[17]

Spectral edge dynamics of training trajectories: Signal–noise geometry across scales.arXiv preprint arXiv:2603.15678,

Yongzhong Xu. Spectral edge dynamics of training trajectories: Signal–noise geometry across scales.arXiv preprint arXiv:2603.15678,

work page arXiv
[18]

Approaching deep learning through the spectral dynamics of weights.arXiv preprint arXiv:2411.14108,

David Yunis, Kumar Kshitij Patel, Sham Kakade, Abdeslam Boularias, Qi Duan, Preetum Nakkiran, and Daniel Soudry. Approaching deep learning through the spectral dynamics of weights.arXiv preprint arXiv:2411.14108,

work page arXiv
[19]

, lL−1}, interval∆t, typesT={Q, K, V, O,MLP ↑,MLP ↓} 1:Initialize spectral logS ← ∅ 2:foreach training stept= 0,∆t,2∆t,

11 Algorithm 2Spectral Monitoring During Training Require:ModelMwith layers{l 0, . . . , lL−1}, interval∆t, typesT={Q, K, V, O,MLP ↑,MLP ↓} 1:Initialize spectral logS ← ∅ 2:foreach training stept= 0,∆t,2∆t, . . .do 3:foreach layerl, each typeτ∈ Tdo 4:σ←SVDVALS(M[l, τ]) 5:ComputeR s,α,H, spectral gap 6:S ← S ∪ {(t, l, τ, R s, α, H, σ1/σ2)} 7:end for 8:end ...

work page 2024
[20]

Stable rank drops rapidly and relatively uniformly at the beginning of training, whereas the per-layer α trajectories separate early and remain separated

provides the most direct intuition for the two-timescale story. Stable rank drops rapidly and relatively uniformly at the beginning of training, whereas the per-layer α trajectories separate early and remain separated. This is the appendix-level visualization behind the claim that compression equilibrates while spectral-shape specialization persists. Figu...

work page 2000