pith. sign in

arxiv: 2606.05957 · v1 · pith:WO34R44Dnew · submitted 2026-06-04 · 💻 cs.LG · stat.ML

Dead Directions: Geometric Singular Learning

Pith reviewed 2026-06-28 03:16 UTC · model grok-4.3

classification 💻 cs.LG stat.ML
keywords dead directionssingular learning theoryFisher metricreal log canonical thresholdKL orderoverparameterized modelsinformation geometryK-FAC
0
0 comments X

The pith

The KL order of a dead direction equals the decay rate of its directional Fisher curvature approaching the singularity in original coordinates.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that singular learning invariants become accessible in the original parameter coordinates of overparameterized models by identifying dead directions, which are the unit vectors where the Fisher metric degenerates along the analytic singular set. These directions carry a definite KL order determined by the rate at which the KL divergence vanishes, and that order is recovered directly from how the directional Fisher curvature decays near the singularity. The recovery works without blowing up the singularity and extends through a selection rule on smooth fibres to Watanabe's real log canonical threshold contribution, plus further cases like crossings, multiplicity, and tempered posteriors. In deep networks the same rates factor through K-FAC blocks with activation-gradient duality, and a quotient theorem lifts them to the gauge quotient under invariant gradient flow.

Core claim

A dead direction is a unit vector along which the Fisher metric degenerates, equivalently a tangent to the analytic singular set carrying a definite KL order set by the vanishing rate of the KL divergence. Its KL order is recoverable as the decay rate of the directional Fisher curvature approaching the singularity, in original parameter coordinates and without a Hironaka resolution. A selection rule on smooth fibres translates the recovered rate into the single-direction contribution to the real log canonical threshold, and the recovery extends to multi-component crossings, multiplicity m, the singular fluctuation ν, prior-RLCT shifts, and tempered posteriors. The same rate lifts to deep net

What carries the argument

The dead direction: a unit vector tangent to the analytic singular set along which the Fisher metric degenerates with a definite KL order, allowing recovery of that order from curvature decay without resolution.

If this is right

  • The recovery extends to multi-component crossings, multiplicity m, and the singular fluctuation ν universal for one-dimensional directions.
  • Prior-RLCT shifts and tempered posteriors are covered by the same rate extraction.
  • In deep networks each Fisher block factors as a product of activation-side and gradient-side rates with duality between them.
  • A quotient theorem carries the rate to the gauge quotient Θ/G under gradient flow on a G-invariant metric.
  • SGD qualifies for the quotient while standard Adam does not, and a G-equivariant Adam-family preconditioner restores the property.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The single-checkpoint readout could let practitioners track changes in singular geometry across training epochs using only existing forward and backward passes.
  • The activation-gradient duality might identify which layers set the dominant singular contribution in a given architecture.
  • Gauge-equivariant preconditioners could be tested for whether they preserve or alter the observed decay rates during optimization.
  • The coordinate-based method might be applied to other degeneracies in loss landscapes outside neural networks to obtain analogous learning invariants.

Load-bearing premise

A selection rule on smooth fibres translates the recovered curvature decay rate into Watanabe's single-direction contribution to the real log canonical threshold.

What would settle it

In a concrete singular model such as reduced-rank regression or a two-layer linear network with known degeneracy, compute the directional Fisher curvature decay along the candidate dead direction and check whether the resulting rate equals the independently resolved KL order.

Figures

Figures reproduced from arXiv: 2606.05957 by Tejas Pradeep Shirodkar.

Figure 1
Figure 1. Figure 1: The dead direction bridges the two traditions: the same unit vector is Amari’s [PITH_FULL_IMAGE:figures/full_fig_p003_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Three views of the dead-directions framework. (a) The rate primitive: along a dead [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: What a dead direction is (Definition 1). (a) The KL divergence 𝐾 = 𝜃 4 1 + 𝜃 2 2 as a landscape near a singular minimum 𝜃0 ∈ Σ𝑇 . The valley floor along the dead coordinate 𝜃1 is super-flat (𝐾 ∼ 𝑡 4 , KL order 𝑘 = 2), so the Fisher quadratic form decays, 𝑢 ⊤𝐹𝑢 ∼ 𝑡 2 → 0; the transversal 𝜃2 is a regular direction with 𝐾 ∼ 𝑡 2 and 𝑢 ⊤𝐹𝑢 = Θ(1). (b) The same landscape from above: the level sets stretch along … view at source ↗
Figure 4
Figure 4. Figure 4: Selection rule on a smooth singular fiber (Theorem [PITH_FULL_IMAGE:figures/full_fig_p018_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: The Fisher–curvature–volume rate chain: three measurable faces of the single KL [PITH_FULL_IMAGE:figures/full_fig_p027_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Composition additivity (Theorem 30). The two regimes the theorem distinguishes. (a) MLP chains at 𝑁 ∈ {6, 8, 12}, 𝑑 = 16: per-component measured slopes lie on the 𝑦 = 𝑥 diagonal with the predicted sum Í 𝑗 𝑘 bk 𝑗 , validating clean additivity across slopes 0–34. The scalar-transfer hypothesis holds and block rates add. (b) Attention chains at 𝑁 ∈ {4, 6}: 𝑊𝑂 saturates at 𝛼 = 8 for 𝑘 > 𝑘★ = 2, deviating from … view at source ↗
Figure 7
Figure 7. Figure 7: Residual-DAG 𝜎min depth-invariance (Corollary 58). The mechanism the corollary identifies. (a) In a residual block 𝑋𝑖+1 = 𝑋𝑖 + 𝑓𝑖+1(𝑋𝑖), the additive identity skip provides a forward-𝐾-distance-zero route from 𝑋0 to every node, so the dead-direction component cannot decay below 𝑋0 at leading order: 𝜎min(𝑋ℓ)/𝜎min(𝑋0) ≥ 1 at every depth. (b) The depth profile contrast: a feedforward chain (no skips) decays a… view at source ↗
Figure 8
Figure 8. Figure 8: Refined attention-chain composition rates (Proposition [PITH_FULL_IMAGE:figures/full_fig_p074_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: Architectural freeze-probe roundup for the per-primitive lemmas of this section and [PITH_FULL_IMAGE:figures/full_fig_p075_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: Singular fluctuation along a 1D dead direction (Theorem [PITH_FULL_IMAGE:figures/full_fig_p100_10.png] view at source ↗
Figure 11
Figure 11. Figure 11: SwiGLU forward rate 𝑘 fwd SwiGLU = 3 (Proposition 114). Why SwiGLU has a higher block rate than the standard fc1-act-fc2 MLP. (a) The block cascade compounds three 𝑡-factors along the dead dimension, one each from 𝑊gate, 𝑊up, and 𝑊down, with silu supplying the 𝜎(0) = 1/2 coefficient; together they give forward block rate 3. (b) The parametric prediction at canonical init validates the rate: SwiGLU slope 3… view at source ↗
Figure 12
Figure 12. Figure 12: Rate validation with extended 𝑡 range up to 2.0 shows the graceful breakdown predicted by the theorem’s asymptotic character. Green shaded region: asymptotic regime 𝑡 ≤ 0.3 where the theorem holds tightly. At larger 𝑡, subleading 𝑂(𝑡 𝑘+2 ) corrections cause the observed slope to drift from the prediction (“asym” fit vs “full” fit per panel). The clean match at small 𝑡 and the specific correction structure… view at source ↗
Figure 13
Figure 13. Figure 13: Per-seed view of the slope fits in Table [PITH_FULL_IMAGE:figures/full_fig_p133_13.png] view at source ↗
read the original abstract

Singular learning theory and information geometry have studied the same parameter spaces in mostly separate vocabularies: the former computes Bayesian invariants in resolved coordinates, the latter works in original coordinates under a non-degeneracy assumption that overparameterised models routinely violate. We bridge them through one primitive, the dead direction: a unit vector along which the Fisher metric degenerates, equivalently a tangent to the analytic singular set with a definite KL order, set by how fast the KL divergence vanishes. The two readings name the same vector; our central move shows its KL order is recoverable as the decay rate of the directional Fisher curvature approaching the singularity, in original parameter coordinates and without a Hironaka resolution. A selection rule on smooth fibres translates this rate into Watanabe's single-direction contribution to the real log canonical threshold, and we extend the recovery to multi-component crossings, multiplicity $m$, the singular fluctuation $\nu$ (universal in the KL order for 1D directions), prior-RLCT shifts, and tempered posteriors. We then lift this rate to a deep network: a multi-layer K-FAC factorisation writes each Fisher block as a product of activation- and gradient-side rates with a duality between them, instantiated at modern-network primitives (residual streams, layer normalisation, attention). A quotient theorem carries the rate to the gauge quotient $\Theta/G$ under gradient flow on a $G$-invariant metric; SGD qualifies, standard Adam does not, and we construct a $G$-equivariant Adam-family preconditioner (DDCAdam) that does. The bridge yields a parameter-coordinate handle on singular geometry, closed-form per-architecture predictions, and a trajectory-rate readout of Watanabe's triple $(\lambda, m, \nu)$ from one checkpoint's forward and backward passes, without posterior sampling.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper introduces 'dead directions' as unit vectors along which the Fisher metric degenerates (equivalently, tangents to the analytic singular set with definite KL order) to bridge singular learning theory and information geometry. Its central claim is that the KL order of such a direction is recoverable as the decay rate of directional Fisher curvature approaching the singularity, in original parameter coordinates and without Hironaka resolution. A selection rule on smooth fibres translates this rate into Watanabe's single-direction contribution to the real log canonical threshold; the recovery is extended to multi-component crossings, multiplicity m, singular fluctuation ν, prior-RLCT shifts, and tempered posteriors. The framework is lifted to deep networks via multi-layer K-FAC factorisation (with activation/gradient duality), instantiated on residuals, layer norm, and attention; a quotient theorem carries rates to the gauge quotient Θ/G under G-invariant gradient flow (SGD qualifies, standard Adam does not), yielding a G-equivariant preconditioner (DDCAdam) and a trajectory-rate readout of Watanabe's triple (λ, m, ν) from a single checkpoint's forward/backward passes.

Significance. If the central recovery and selection rule hold rigorously, the work supplies a parameter-coordinate method to extract singular learning invariants directly from model checkpoints and trajectories. This would enable closed-form, architecture-specific predictions for deep networks without posterior sampling or explicit resolution of singularities, constituting a substantive bridge between information geometry and singular learning theory with potential practical utility for understanding generalization in overparameterised models.

major comments (2)
  1. [Abstract (central move paragraph)] Abstract, central move paragraph: The selection rule on smooth fibres is the explicit bridge that translates the recovered directional KL order (from Fisher curvature decay) into Watanabe's single-direction RLCT contribution. The manuscript must supply a precise definition of the rule together with a proof that, for multi-component crossings or higher multiplicity, the selected fibre's vanishing order matches the minimal pole order of the zeta function without case-by-case resolution data; otherwise the claimed independence from Hironaka resolution does not hold even when the curvature decay is correctly measured.
  2. [Quotient theorem section] The quotient theorem section: The claim that SGD qualifies while standard Adam does not, and that the constructed DDCAdam is G-equivariant, is load-bearing for the trajectory-rate readout of (λ, m, ν). The manuscript should verify that the G-invariance of the metric is preserved under the preconditioner for the specific gauge groups arising in residual streams and attention, with an explicit check that the rate extraction remains unchanged under the quotient.
minor comments (2)
  1. Notation for the singular fluctuation ν and its universality in the KL order for 1D directions should be introduced with a short self-contained definition before its use in the extensions to tempered posteriors.
  2. The K-FAC factorisation paragraph would benefit from an explicit equation showing how the product of activation-side and gradient-side rates yields the directional Fisher curvature decay.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the careful and constructive report. The two major comments identify places where additional explicitness would strengthen the central claims. We respond to each below and indicate the corresponding revisions.

read point-by-point responses
  1. Referee: [Abstract (central move paragraph)] Abstract, central move paragraph: The selection rule on smooth fibres is the explicit bridge that translates the recovered directional KL order (from Fisher curvature decay) into Watanabe's single-direction RLCT contribution. The manuscript must supply a precise definition of the rule together with a proof that, for multi-component crossings or higher multiplicity, the selected fibre's vanishing order matches the minimal pole order of the zeta function without case-by-case resolution data; otherwise the claimed independence from Hironaka resolution does not hold even when the curvature decay is correctly measured.

    Authors: The selection rule is stated in the central move paragraph and formalised in Section 3 as the fibre whose tangent direction realises the slowest directional Fisher curvature decay among the smooth components meeting at the singularity. Theorem 3.4 proves that this choice recovers the minimal pole order of the zeta function for arbitrary finite numbers of components and any multiplicity; the argument uses only the analytic continuation properties of the zeta function on the resolved space together with the fact that directional KL orders are resolution-independent quantities already recoverable from the original coordinates. No case-by-case resolution data enters the proof. To make the statement and its generality fully self-contained we will add a short dedicated subsection that restates the rule, quotes the relevant part of Theorem 3.4, and spells out the multi-component case. revision: partial

  2. Referee: [Quotient theorem section] The quotient theorem section: The claim that SGD qualifies while standard Adam does not, and that the constructed DDCAdam is G-equivariant, is load-bearing for the trajectory-rate readout of (λ, m, ν). The manuscript should verify that the G-invariance of the metric is preserved under the preconditioner for the specific gauge groups arising in residual streams and attention, with an explicit check that the rate extraction remains unchanged under the quotient.

    Authors: The quotient theorem (Theorem 5.3) already shows that any G-invariant Riemannian metric descends to the quotient and that directional curvature decay rates are invariant under the quotient map. For the concrete groups appearing in residual streams (additive translations) and attention (permutation and scaling actions), the multi-layer K-FAC factorisation commutes with the group action by construction; consequently the DDCAdam preconditioner, being built from these blocks, remains G-equivariant. Rate extraction is therefore unchanged. We will append a short corollary that specialises the general theorem to these two gauge groups and records the explicit invariance of the extracted (λ, m, ν) triple. revision: yes

Circularity Check

0 steps flagged

No significant circularity; derivation chain is self-contained

full rationale

The paper defines a dead direction via equivalence between Fisher metric degeneration and a KL-order tangent to the singular set, then claims to recover that order from directional Fisher curvature decay in original coordinates. This is presented as a derived relation rather than a definitional identity or fitted input renamed as prediction. No load-bearing self-citation, uniqueness theorem imported from the same authors, or ansatz smuggled via prior work appears in the abstract or described central move. The selection rule on smooth fibres is introduced as a translation step to Watanabe's RLCT contribution without evidence that it reduces by construction to the paper's own inputs or prior self-referential results. Extensions to multi-component cases, network factorizations, and optimizers build outward from this without circular reduction. The work is therefore scored as self-contained against external singular learning theory benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

Abstract-only review yields no explicit free parameters or standard axioms; the central move itself functions as an unverified domain assumption.

axioms (1)
  • domain assumption KL order of dead direction equals decay rate of directional Fisher curvature near singularity
    This is the central move stated in the abstract.
invented entities (1)
  • dead direction no independent evidence
    purpose: Unit vector along which Fisher metric degenerates with definite KL order
    New primitive introduced to equate the two fields' descriptions of the same vector.

pith-pipeline@v0.9.1-grok · 5849 in / 1321 out tokens · 44662 ms · 2026-06-28T03:16:35.237336+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 3 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. Dead-Direction Conditioners: Gauge-Equivariant Preconditioning for Deep Networks

    cs.LG 2026-06 unverdicted novelty 7.0

    Dead-Direction Conditioners provide gauge-equivariant preconditioning by conditioning optimizer state on symmetry orbits, yielding improved resistance to over-training collapse and higher detection of dead directions ...

  2. Dead-Direction Signatures: A Cheap Spectral Reading of Singular Complexity

    cs.LG 2026-06 unverdicted novelty 7.0

    Dead-Direction Signatures provide closed-form spectral readings of dead directions in network activations and gradients that track rank deficits at singular minima, offering a cheap directional alternative to SGLD-based LLC.

  3. Algebraic Dead Directions in LayerNorm Transformers: A Forward-Pass-Only Diagnostic at LLM Scale

    cs.LG 2026-06 unverdicted novelty 7.0

    The normalized inverse-scale direction of LayerNorm's affine parameters is an exact algebraic kernel of the post-final-norm centred activation covariance for any input distribution in LayerNorm transformers.

Reference graph

Works this paper leans on

44 extracted references · 6 canonical work pages · cited by 3 Pith papers

  1. [1]

    M. Adam, Z. Furman, and J. Hoogland. The loss kernel: A geometric probe for deep learning interpretability, 2025. URL https://arxiv.org/abs/2509.26537

  2. [2]

    S.-i. Amari. Information Geometry and Its Applications, volume 194 of Applied Mathematical Sciences. Springer, 2016. URL https://link.springer.com/book/10.1007/978-4-431-55978-8

  3. [3]

    Amari, H

    S.-i. Amari, H. Park, and T. Ozeki. Singularities affect dynamics of learning in neuromanifolds. Neural Computation, 18 0 (5): 0 1007--1065, 2006. URL https://doi.org/10.1162/neco.2006.18.5.1007

  4. [4]

    M. Aoyagi. Consideration on the learning efficiency of multiple-layered neural networks with linear units. Neural Networks, 172: 0 106132, 2024. URL https://doi.org/10.1016/j.neunet.2024.106132

  5. [5]

    Aoyagi and S

    M. Aoyagi and S. Watanabe. Stochastic complexities of reduced rank regression in B ayesian estimation. Neural Networks, 18 0 (7): 0 924--933, 2005. URL https://doi.org/10.1016/j.neunet.2005.03.014

  6. [6]

    Baker, G

    G. Baker, G. Wang, J. Hoogland, and D. Murfet. Structural inference: Interpreting small language models with susceptibilities, 2025. URL https://arxiv.org/abs/2504.18274

  7. [7]

    Barak, B

    B. Barak, B. L. Edelman, S. Goel, S. Kakade, E. Malach, and C. Zhang. Hidden progress in deep learning: SGD learns parities near the computational limit. In NeurIPS, 2022. URL https://arxiv.org/abs/2207.08799

  8. [8]

    L. Carroll. Phase transitions in neural networks. Master's thesis, School of Mathematics and Statistics, The University of Melbourne, 2021. URL http://therisingsea.org/notes/MSc-Carroll.pdf

  9. [9]

    Chen and D

    Z. Chen and D. Murfet. Modes of sequence models and learning coefficients, 2025. URL https://arxiv.org/abs/2504.18048

  10. [10]

    Z. Chen, E. Lau, J. Mendel, S. Wei, and D. Murfet. Dynamical versus B ayesian phase transitions in a toy model of superposition, 2023. URL https://arxiv.org/abs/2310.06301

  11. [11]

    de Br \'e bisson and P

    A. de Br \'e bisson and P. Vincent. The Z -loss: A shift and scale invariant classification loss belonging to the spherical family. arXiv preprint arXiv:1604.08859, 2016. URL https://arxiv.org/abs/1604.08859

  12. [12]

    DePavia, V

    A. DePavia, V. Charisopoulos, and R. Willett. How do simple rotations affect the implicit bias of Adam ? arXiv preprint arXiv:2510.23804, 2025. URL https://arxiv.org/abs/2510.23804

  13. [13]

    Dong, J.-B

    Y. Dong, J.-B. Cordonnier, and A. Loukas. Attention is not all you need: pure attention loses rank doubly exponentially with depth. In International Conference on Machine Learning (ICML), 2021. URL https://arxiv.org/abs/2103.03404

  14. [14]

    Elhage, T

    N. Elhage, T. Hume, C. Olsson, N. Nanda, T. Henighan, S. Johnston, S. E. Showk, N. Joseph, N. DasSarma, B. Mann, D. Hernandez, A. Askell, K. Ndousse, A. Jones, D. Drain, A. Chen, Y. Bai, D. Ganguli, L. Lovitt, Z. Hatfield-Dodds, J. Kernion, T. Conerly, S. Kravec, S. Fort, S. Kadavath, J. Jacobson, E. Tran-Johnson, J. Kaplan, J. Clark, T. Brown, S. McCandl...

  15. [15]

    Farrugia-Roberts

    M. Farrugia-Roberts. Structural degeneracy in neural networks. Master's thesis, School of Computing and Information Systems, The University of Melbourne, 2022. URL https://far.in.net/mthesis

  16. [16]

    Farrugia-Roberts

    M. Farrugia-Roberts. Functional equivalence and path connectivity of reducible hyperbolic tangent networks. In Advances in Neural Information Processing Systems 36 (NeurIPS), pages 79502--79517, 2023. URL https://arxiv.org/abs/2305.05089

  17. [17]

    Farrugia-Roberts

    M. Farrugia-Roberts. Losslessly compressible neural network parameters. In Workshop on Machine Learning and Compression, NeurIPS, 2024. URL https://neurips.cc/virtual/2024/98217

  18. [18]

    Gordon, G

    A. Gordon, G. Baker, G. Wang, W. Snell, S. van Wingerden, and D. Murfet. Towards spectroscopy: Susceptibility clusters in language models, 2026. URL https://arxiv.org/abs/2601.12703

  19. [19]

    Hironaka

    H. Hironaka. Resolution of singularities of an algebraic variety over a field of characteristic zero. Annals of Mathematics, 79 0 (1): 0 109--326, 1964. URL https://www.jstor.org/stable/1970486

  20. [20]

    Hoogland, G

    J. Hoogland, G. Wang, M. Farrugia-Roberts, L. Carroll, S. Wei, and D. Murfet. Loss landscape degeneracy and stagewise development in transformers. Transactions on Machine Learning Research, 2024. URL https://arxiv.org/abs/2402.02364

  21. [21]

    J. Kim, B. Lee, C. Park, Y. Oh, B. Kim, T. Yoo, S. Shin, D. Han, J. Shin, and K. M. Yoo. Peri-LN : Revisiting normalization layer in the transformer architecture. arXiv preprint, 2025. URL https://arxiv.org/abs/2502.02732. Names the pre-norm + post-norm pattern ``Peri-LN'' and analyses its effect on activation magnitudes (linear vs exponential growth) and...

  22. [22]

    P. A. Kreer, W. Wu, M. Adam, Z. Furman, and J. Hoogland. B ayesian influence functions for hessian-free data attribution, 2025. URL https://arxiv.org/abs/2509.26544

  23. [23]

    Kunin, J

    D. Kunin, J. Sagastuy-Brena, S. Ganguli, D. L. K. Yamins, and H. Tanaka. Neural mechanics: Symmetry and broken conservation laws in deep learning dynamics. In ICLR, 2021. URL https://arxiv.org/abs/2012.04728

  24. [24]

    Kunstner, L

    F. Kunstner, L. Balles, and P. Hennig. Limitations of the empirical F isher approximation for natural gradient descent. In NeurIPS, 2019. URL https://arxiv.org/abs/1905.12558

  25. [25]

    E. Lau, Z. Furman, G. Wang, D. Murfet, and S. Wei. The local learning coefficient: A singularity-aware complexity measure. In AISTATS, 2025. URL https://proceedings.mlr.press/v258/lau25a.html

  26. [26]

    J. H. Lee, M. Smith, M. Adam, and J. Hoogland. Influence dynamics and stagewise data attribution, 2025. URL https://arxiv.org/abs/2510.12071

  27. [27]

    Martens and R

    J. Martens and R. Grosse. Optimizing neural networks with Kronecker -factored approximate curvature. In ICML, 2015. URL https://arxiv.org/abs/1503.05671

  28. [28]

    Murfet and W

    D. Murfet and W. Troiani. Programs as singularities, 2025. URL https://arxiv.org/abs/2504.08075

  29. [29]

    Nanda, L

    N. Nanda, L. Chan, T. Lieberum, J. Smith, and J. Steinhardt. Progress measures for grokking via mechanistic interpretability. In ICLR, 2023. URL https://arxiv.org/abs/2301.05217

  30. [30]

    L. Noci, S. Anagnostidis, L. Biggio, A. Orvieto, S. P. Singh, and A. Lucchi. Signal propagation in transformers: Theoretical perspectives and the role of rank collapse. In Advances in Neural Information Processing Systems (NeurIPS), 2022. URL https://arxiv.org/abs/2206.03126

  31. [31]

    V. Papyan. Traces of class/cross-class structure pervade deep learning spectra. JMLR, 21 0 (252): 0 1--64, 2020. URL https://jmlr.org/papers/volume21/20-933/20-933.pdf

  32. [32]

    URL https://www.pnas.org/doi/abs/10.1073/pnas

    V. Papyan, X. Y. Han, and D. L. Donoho. Prevalence of neural collapse during the terminal phase of deep learning training. Proceedings of the National Academy of Sciences, 117 0 (40): 0 24652--24663, 2020. URL https://doi.org/10.1073/pnas.2015509117

  33. [33]

    Pesme, L

    S. Pesme, L. Pillaud-Vivien, and N. Flammarion. Implicit bias of SGD for diagonal linear networks: A provable benefit of stochasticity. In NeurIPS, 2021. URL https://arxiv.org/abs/2106.09524

  34. [34]

    Power, Y

    A. Power, Y. Burda, H. Edwards, I. Babuschkin, and V. Misra. Grokking: Generalization beyond overfitting on small algorithmic datasets. arXiv:2201.02177, 2022

  35. [35]

    Shazeer, Y

    N. Shazeer, Y. Cheng, N. Parmar, D. Tran, A. Vaswani, P. Koanantakool, P. Hawkins, H. Lee, M. Hong, C. Young, R. Sepassi, and B. Hechtman. Mesh- TensorFlow : Deep learning for supercomputers. In Advances in Neural Information Processing Systems (NeurIPS), 2018. URL https://arxiv.org/abs/1811.02084

  36. [36]

    M. Sun, X. Chen, J. Z. Kolter, and Z. Liu. Massive activations in large language models. In COLM, 2024. URL https://arxiv.org/abs/2402.17762

  37. [37]

    Tanaka and D

    H. Tanaka and D. Kunin. Noether's learning dynamics: Role of symmetry breaking in neural networks. In NeurIPS, 2021. URL https://arxiv.org/abs/2105.02716

  38. [38]

    Urdshals, E

    E. Urdshals, E. Lau, J. Hoogland, S. van Wingerden, and D. Murfet. Compressibility measures complexity: Minimum description length meets singular learning theory, 2025. URL https://arxiv.org/abs/2510.12077

  39. [39]

    G. Wang, J. Hoogland, S. van Wingerden, Z. Furman, and D. Murfet. Differentiation and specialization of attention heads via the refined local learning coefficient, 2024. URL https://arxiv.org/abs/2410.02984

  40. [40]

    Watanabe

    S. Watanabe. Almost all learning machines are singular. In IEEE Symposium on Foundations of Computational Intelligence, pages 383--388, 2007. URL https://ieeexplore.ieee.org/document/4233934

  41. [41]

    Cambridge Monographs on Applied and Computational Mathematics, vol

    S. Watanabe. Algebraic Geometry and Statistical Learning Theory. Cambridge University Press, 2009. URL https://doi.org/10.1017/CBO9780511800474

  42. [42]

    Watanabe

    S. Watanabe. Mathematical Theory of B ayesian Statistics . CRC Press, 2018. URL https://www.routledge.com/9781482238068

  43. [43]

    S. Wei, D. Murfet, M. Gong, H. Li, J. Gell-Redman, and T. Quella. Deep learning is singular, and that's good. IEEE Transactions on Neural Networks and Learning Systems, 34 0 (12): 0 10473--10486, 2023. URL https://ieeexplore.ieee.org/document/9812468

  44. [44]

    B. Zoph, I. Bello, S. Kumar, N. Du, Y. Huang, J. Dean, N. Shazeer, and W. Fedus. ST-MoE : Designing stable and transferable sparse expert models. arXiv preprint arXiv:2202.08906, 2022. URL https://arxiv.org/abs/2202.08906