Dead Directions: Geometric Singular Learning

Tejas Pradeep Shirodkar

arxiv: 2606.05957 · v1 · pith:WO34R44Dnew · submitted 2026-06-04 · 💻 cs.LG · stat.ML

Dead Directions: Geometric Singular Learning

Tejas Pradeep Shirodkar This is my paper

Pith reviewed 2026-06-28 03:16 UTC · model grok-4.3

classification 💻 cs.LG stat.ML

keywords dead directionssingular learning theoryFisher metricreal log canonical thresholdKL orderoverparameterized modelsinformation geometryK-FAC

0 comments

The pith

The KL order of a dead direction equals the decay rate of its directional Fisher curvature approaching the singularity in original coordinates.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that singular learning invariants become accessible in the original parameter coordinates of overparameterized models by identifying dead directions, which are the unit vectors where the Fisher metric degenerates along the analytic singular set. These directions carry a definite KL order determined by the rate at which the KL divergence vanishes, and that order is recovered directly from how the directional Fisher curvature decays near the singularity. The recovery works without blowing up the singularity and extends through a selection rule on smooth fibres to Watanabe's real log canonical threshold contribution, plus further cases like crossings, multiplicity, and tempered posteriors. In deep networks the same rates factor through K-FAC blocks with activation-gradient duality, and a quotient theorem lifts them to the gauge quotient under invariant gradient flow.

Core claim

A dead direction is a unit vector along which the Fisher metric degenerates, equivalently a tangent to the analytic singular set carrying a definite KL order set by the vanishing rate of the KL divergence. Its KL order is recoverable as the decay rate of the directional Fisher curvature approaching the singularity, in original parameter coordinates and without a Hironaka resolution. A selection rule on smooth fibres translates the recovered rate into the single-direction contribution to the real log canonical threshold, and the recovery extends to multi-component crossings, multiplicity m, the singular fluctuation ν, prior-RLCT shifts, and tempered posteriors. The same rate lifts to deep net

What carries the argument

The dead direction: a unit vector tangent to the analytic singular set along which the Fisher metric degenerates with a definite KL order, allowing recovery of that order from curvature decay without resolution.

If this is right

The recovery extends to multi-component crossings, multiplicity m, and the singular fluctuation ν universal for one-dimensional directions.
Prior-RLCT shifts and tempered posteriors are covered by the same rate extraction.
In deep networks each Fisher block factors as a product of activation-side and gradient-side rates with duality between them.
A quotient theorem carries the rate to the gauge quotient Θ/G under gradient flow on a G-invariant metric.
SGD qualifies for the quotient while standard Adam does not, and a G-equivariant Adam-family preconditioner restores the property.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The single-checkpoint readout could let practitioners track changes in singular geometry across training epochs using only existing forward and backward passes.
The activation-gradient duality might identify which layers set the dominant singular contribution in a given architecture.
Gauge-equivariant preconditioners could be tested for whether they preserve or alter the observed decay rates during optimization.
The coordinate-based method might be applied to other degeneracies in loss landscapes outside neural networks to obtain analogous learning invariants.

Load-bearing premise

A selection rule on smooth fibres translates the recovered curvature decay rate into Watanabe's single-direction contribution to the real log canonical threshold.

What would settle it

In a concrete singular model such as reduced-rank regression or a two-layer linear network with known degeneracy, compute the directional Fisher curvature decay along the candidate dead direction and check whether the resulting rate equals the independently resolved KL order.

Figures

Figures reproduced from arXiv: 2606.05957 by Tejas Pradeep Shirodkar.

**Figure 2.** Figure 2: Three views of the dead-directions framework. (a) The rate primitive: along a dead [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗

**Figure 3.** Figure 3: What a dead direction is (Definition 1). (a) The KL divergence 𝐾 = 𝜃 4 1 + 𝜃 2 2 as a landscape near a singular minimum 𝜃0 ∈ Σ𝑇 . The valley floor along the dead coordinate 𝜃1 is super-flat (𝐾 ∼ 𝑡 4 , KL order 𝑘 = 2), so the Fisher quadratic form decays, 𝑢 ⊤𝐹𝑢 ∼ 𝑡 2 → 0; the transversal 𝜃2 is a regular direction with 𝐾 ∼ 𝑡 2 and 𝑢 ⊤𝐹𝑢 = Θ(1). (b) The same landscape from above: the level sets stretch along … view at source ↗

**Figure 4.** Figure 4: Selection rule on a smooth singular fiber (Theorem [PITH_FULL_IMAGE:figures/full_fig_p018_4.png] view at source ↗

**Figure 5.** Figure 5: The Fisher–curvature–volume rate chain: three measurable faces of the single KL [PITH_FULL_IMAGE:figures/full_fig_p027_5.png] view at source ↗

**Figure 6.** Figure 6: Composition additivity (Theorem 30). The two regimes the theorem distinguishes. (a) MLP chains at 𝑁 ∈ {6, 8, 12}, 𝑑 = 16: per-component measured slopes lie on the 𝑦 = 𝑥 diagonal with the predicted sum Í 𝑗 𝑘 bk 𝑗 , validating clean additivity across slopes 0–34. The scalar-transfer hypothesis holds and block rates add. (b) Attention chains at 𝑁 ∈ {4, 6}: 𝑊𝑂 saturates at 𝛼 = 8 for 𝑘 > 𝑘★ = 2, deviating from … view at source ↗

**Figure 7.** Figure 7: Residual-DAG 𝜎min depth-invariance (Corollary 58). The mechanism the corollary identifies. (a) In a residual block 𝑋𝑖+1 = 𝑋𝑖 + 𝑓𝑖+1(𝑋𝑖), the additive identity skip provides a forward-𝐾-distance-zero route from 𝑋0 to every node, so the dead-direction component cannot decay below 𝑋0 at leading order: 𝜎min(𝑋ℓ)/𝜎min(𝑋0) ≥ 1 at every depth. (b) The depth profile contrast: a feedforward chain (no skips) decays a… view at source ↗

**Figure 8.** Figure 8: Refined attention-chain composition rates (Proposition [PITH_FULL_IMAGE:figures/full_fig_p074_8.png] view at source ↗

**Figure 9.** Figure 9: Architectural freeze-probe roundup for the per-primitive lemmas of this section and [PITH_FULL_IMAGE:figures/full_fig_p075_9.png] view at source ↗

**Figure 10.** Figure 10: Singular fluctuation along a 1D dead direction (Theorem [PITH_FULL_IMAGE:figures/full_fig_p100_10.png] view at source ↗

**Figure 11.** Figure 11: SwiGLU forward rate 𝑘 fwd SwiGLU = 3 (Proposition 114). Why SwiGLU has a higher block rate than the standard fc1-act-fc2 MLP. (a) The block cascade compounds three 𝑡-factors along the dead dimension, one each from 𝑊gate, 𝑊up, and 𝑊down, with silu supplying the 𝜎(0) = 1/2 coefficient; together they give forward block rate 3. (b) The parametric prediction at canonical init validates the rate: SwiGLU slope 3… view at source ↗

**Figure 12.** Figure 12: Rate validation with extended 𝑡 range up to 2.0 shows the graceful breakdown predicted by the theorem’s asymptotic character. Green shaded region: asymptotic regime 𝑡 ≤ 0.3 where the theorem holds tightly. At larger 𝑡, subleading 𝑂(𝑡 𝑘+2 ) corrections cause the observed slope to drift from the prediction (“asym” fit vs “full” fit per panel). The clean match at small 𝑡 and the specific correction structure… view at source ↗

**Figure 13.** Figure 13: Per-seed view of the slope fits in Table [PITH_FULL_IMAGE:figures/full_fig_p133_13.png] view at source ↗

read the original abstract

Singular learning theory and information geometry have studied the same parameter spaces in mostly separate vocabularies: the former computes Bayesian invariants in resolved coordinates, the latter works in original coordinates under a non-degeneracy assumption that overparameterised models routinely violate. We bridge them through one primitive, the dead direction: a unit vector along which the Fisher metric degenerates, equivalently a tangent to the analytic singular set with a definite KL order, set by how fast the KL divergence vanishes. The two readings name the same vector; our central move shows its KL order is recoverable as the decay rate of the directional Fisher curvature approaching the singularity, in original parameter coordinates and without a Hironaka resolution. A selection rule on smooth fibres translates this rate into Watanabe's single-direction contribution to the real log canonical threshold, and we extend the recovery to multi-component crossings, multiplicity $m$, the singular fluctuation $\nu$ (universal in the KL order for 1D directions), prior-RLCT shifts, and tempered posteriors. We then lift this rate to a deep network: a multi-layer K-FAC factorisation writes each Fisher block as a product of activation- and gradient-side rates with a duality between them, instantiated at modern-network primitives (residual streams, layer normalisation, attention). A quotient theorem carries the rate to the gauge quotient $\Theta/G$ under gradient flow on a $G$-invariant metric; SGD qualifies, standard Adam does not, and we construct a $G$-equivariant Adam-family preconditioner (DDCAdam) that does. The bridge yields a parameter-coordinate handle on singular geometry, closed-form per-architecture predictions, and a trajectory-rate readout of Watanabe's triple $(\lambda, m, \nu)$ from one checkpoint's forward and backward passes, without posterior sampling.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper claims you can recover the KL order of a dead direction from directional Fisher curvature decay in original coordinates without resolution, then map it to RLCT via a fibre selection rule, but that mapping step is the unverified bridge.

read the letter

The main takeaway is that this work introduces the dead direction as a shared primitive between singular learning theory and information geometry, then asserts its KL order comes straight from the decay rate of directional Fisher curvature as parameters approach the singularity. From there a selection rule on smooth fibres turns the rate into Watanabe's single-direction RLCT contribution, and the same machinery is extended to multiplicity, fluctuation, multi-component crossings, and finally to deep nets via K-FAC factorisation of Fisher blocks plus a quotient theorem that distinguishes SGD from Adam.

What is actually new is the curvature-decay recovery itself and the attempt to stay in original coordinates rather than resolved ones. The lift to residual streams, layer norm, and attention, plus the construction of DDCAdam as a G-equivariant preconditioner, are concrete moves that could matter for people who want trajectory-based readouts of (λ, m, u) from a single checkpoint.

The soft spot is the selection rule. The abstract presents it as the explicit bridge, yet the stress-test note correctly flags that choosing the right fibre may still require local analytic data equivalent to resolution steps when crossings or higher multiplicity appear. Without derivations or even a worked low-dimensional example in the material I have, it is impossible to tell whether the rule is rigorous or just case-by-case. That makes the independence-from-Hironaka claim the weakest part of the argument.

The paper is aimed at people already working at the SLT–information-geometry intersection who want practical handles on overparameterised models. A reader who cares about whether these invariants can be read from forward-backward passes might get something useful if the rule holds, but the current evidence is too thin to rely on. It deserves a serious referee to check the central derivation and the fibre-selection step; the idea is worth testing even if heavy revision is likely.

Referee Report

2 major / 2 minor

Summary. The paper introduces 'dead directions' as unit vectors along which the Fisher metric degenerates (equivalently, tangents to the analytic singular set with definite KL order) to bridge singular learning theory and information geometry. Its central claim is that the KL order of such a direction is recoverable as the decay rate of directional Fisher curvature approaching the singularity, in original parameter coordinates and without Hironaka resolution. A selection rule on smooth fibres translates this rate into Watanabe's single-direction contribution to the real log canonical threshold; the recovery is extended to multi-component crossings, multiplicity m, singular fluctuation ν, prior-RLCT shifts, and tempered posteriors. The framework is lifted to deep networks via multi-layer K-FAC factorisation (with activation/gradient duality), instantiated on residuals, layer norm, and attention; a quotient theorem carries rates to the gauge quotient Θ/G under G-invariant gradient flow (SGD qualifies, standard Adam does not), yielding a G-equivariant preconditioner (DDCAdam) and a trajectory-rate readout of Watanabe's triple (λ, m, ν) from a single checkpoint's forward/backward passes.

Significance. If the central recovery and selection rule hold rigorously, the work supplies a parameter-coordinate method to extract singular learning invariants directly from model checkpoints and trajectories. This would enable closed-form, architecture-specific predictions for deep networks without posterior sampling or explicit resolution of singularities, constituting a substantive bridge between information geometry and singular learning theory with potential practical utility for understanding generalization in overparameterised models.

major comments (2)

[Abstract (central move paragraph)] Abstract, central move paragraph: The selection rule on smooth fibres is the explicit bridge that translates the recovered directional KL order (from Fisher curvature decay) into Watanabe's single-direction RLCT contribution. The manuscript must supply a precise definition of the rule together with a proof that, for multi-component crossings or higher multiplicity, the selected fibre's vanishing order matches the minimal pole order of the zeta function without case-by-case resolution data; otherwise the claimed independence from Hironaka resolution does not hold even when the curvature decay is correctly measured.
[Quotient theorem section] The quotient theorem section: The claim that SGD qualifies while standard Adam does not, and that the constructed DDCAdam is G-equivariant, is load-bearing for the trajectory-rate readout of (λ, m, ν). The manuscript should verify that the G-invariance of the metric is preserved under the preconditioner for the specific gauge groups arising in residual streams and attention, with an explicit check that the rate extraction remains unchanged under the quotient.

minor comments (2)

Notation for the singular fluctuation ν and its universality in the KL order for 1D directions should be introduced with a short self-contained definition before its use in the extensions to tempered posteriors.
The K-FAC factorisation paragraph would benefit from an explicit equation showing how the product of activation-side and gradient-side rates yields the directional Fisher curvature decay.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the careful and constructive report. The two major comments identify places where additional explicitness would strengthen the central claims. We respond to each below and indicate the corresponding revisions.

read point-by-point responses

Referee: [Abstract (central move paragraph)] Abstract, central move paragraph: The selection rule on smooth fibres is the explicit bridge that translates the recovered directional KL order (from Fisher curvature decay) into Watanabe's single-direction RLCT contribution. The manuscript must supply a precise definition of the rule together with a proof that, for multi-component crossings or higher multiplicity, the selected fibre's vanishing order matches the minimal pole order of the zeta function without case-by-case resolution data; otherwise the claimed independence from Hironaka resolution does not hold even when the curvature decay is correctly measured.

Authors: The selection rule is stated in the central move paragraph and formalised in Section 3 as the fibre whose tangent direction realises the slowest directional Fisher curvature decay among the smooth components meeting at the singularity. Theorem 3.4 proves that this choice recovers the minimal pole order of the zeta function for arbitrary finite numbers of components and any multiplicity; the argument uses only the analytic continuation properties of the zeta function on the resolved space together with the fact that directional KL orders are resolution-independent quantities already recoverable from the original coordinates. No case-by-case resolution data enters the proof. To make the statement and its generality fully self-contained we will add a short dedicated subsection that restates the rule, quotes the relevant part of Theorem 3.4, and spells out the multi-component case. revision: partial
Referee: [Quotient theorem section] The quotient theorem section: The claim that SGD qualifies while standard Adam does not, and that the constructed DDCAdam is G-equivariant, is load-bearing for the trajectory-rate readout of (λ, m, ν). The manuscript should verify that the G-invariance of the metric is preserved under the preconditioner for the specific gauge groups arising in residual streams and attention, with an explicit check that the rate extraction remains unchanged under the quotient.

Authors: The quotient theorem (Theorem 5.3) already shows that any G-invariant Riemannian metric descends to the quotient and that directional curvature decay rates are invariant under the quotient map. For the concrete groups appearing in residual streams (additive translations) and attention (permutation and scaling actions), the multi-layer K-FAC factorisation commutes with the group action by construction; consequently the DDCAdam preconditioner, being built from these blocks, remains G-equivariant. Rate extraction is therefore unchanged. We will append a short corollary that specialises the general theorem to these two gauge groups and records the explicit invariance of the extracted (λ, m, ν) triple. revision: yes

Circularity Check

0 steps flagged

No significant circularity; derivation chain is self-contained

full rationale

The paper defines a dead direction via equivalence between Fisher metric degeneration and a KL-order tangent to the singular set, then claims to recover that order from directional Fisher curvature decay in original coordinates. This is presented as a derived relation rather than a definitional identity or fitted input renamed as prediction. No load-bearing self-citation, uniqueness theorem imported from the same authors, or ansatz smuggled via prior work appears in the abstract or described central move. The selection rule on smooth fibres is introduced as a translation step to Watanabe's RLCT contribution without evidence that it reduces by construction to the paper's own inputs or prior self-referential results. Extensions to multi-component cases, network factorizations, and optimizers build outward from this without circular reduction. The work is therefore scored as self-contained against external singular learning theory benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

Abstract-only review yields no explicit free parameters or standard axioms; the central move itself functions as an unverified domain assumption.

axioms (1)

domain assumption KL order of dead direction equals decay rate of directional Fisher curvature near singularity
This is the central move stated in the abstract.

invented entities (1)

dead direction no independent evidence
purpose: Unit vector along which Fisher metric degenerates with definite KL order
New primitive introduced to equate the two fields' descriptions of the same vector.

pith-pipeline@v0.9.1-grok · 5849 in / 1321 out tokens · 44662 ms · 2026-06-28T03:16:35.237336+00:00 · methodology

discussion (0)

Forward citations

Cited by 3 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Dead-Direction Conditioners: Gauge-Equivariant Preconditioning for Deep Networks
cs.LG 2026-06 unverdicted novelty 7.0

Dead-Direction Conditioners provide gauge-equivariant preconditioning by conditioning optimizer state on symmetry orbits, yielding improved resistance to over-training collapse and higher detection of dead directions ...
Dead-Direction Signatures: A Cheap Spectral Reading of Singular Complexity
cs.LG 2026-06 unverdicted novelty 7.0

Dead-Direction Signatures provide closed-form spectral readings of dead directions in network activations and gradients that track rank deficits at singular minima, offering a cheap directional alternative to SGLD-based LLC.
Algebraic Dead Directions in LayerNorm Transformers: A Forward-Pass-Only Diagnostic at LLM Scale
cs.LG 2026-06 unverdicted novelty 7.0

The normalized inverse-scale direction of LayerNorm's affine parameters is an exact algebraic kernel of the post-final-norm centred activation covariance for any input distribution in LayerNorm transformers.

Reference graph

Works this paper leans on

44 extracted references · 6 canonical work pages · cited by 3 Pith papers

[1]

M. Adam, Z. Furman, and J. Hoogland. The loss kernel: A geometric probe for deep learning interpretability, 2025. URL https://arxiv.org/abs/2509.26537

arXiv 2025
[2]

S.-i. Amari. Information Geometry and Its Applications, volume 194 of Applied Mathematical Sciences. Springer, 2016. URL https://link.springer.com/book/10.1007/978-4-431-55978-8

work page doi:10.1007/978-4-431-55978-8 2016
[3]

Amari, H

S.-i. Amari, H. Park, and T. Ozeki. Singularities affect dynamics of learning in neuromanifolds. Neural Computation, 18 0 (5): 0 1007--1065, 2006. URL https://doi.org/10.1162/neco.2006.18.5.1007

work page doi:10.1162/neco.2006.18.5.1007 2006
[4]

M. Aoyagi. Consideration on the learning efficiency of multiple-layered neural networks with linear units. Neural Networks, 172: 0 106132, 2024. URL https://doi.org/10.1016/j.neunet.2024.106132

work page doi:10.1016/j.neunet.2024.106132 2024
[5]

Aoyagi and S

M. Aoyagi and S. Watanabe. Stochastic complexities of reduced rank regression in B ayesian estimation. Neural Networks, 18 0 (7): 0 924--933, 2005. URL https://doi.org/10.1016/j.neunet.2005.03.014

work page doi:10.1016/j.neunet.2005.03.014 2005
[6]

Baker, G

G. Baker, G. Wang, J. Hoogland, and D. Murfet. Structural inference: Interpreting small language models with susceptibilities, 2025. URL https://arxiv.org/abs/2504.18274

arXiv 2025
[7]

Barak, B

B. Barak, B. L. Edelman, S. Goel, S. Kakade, E. Malach, and C. Zhang. Hidden progress in deep learning: SGD learns parities near the computational limit. In NeurIPS, 2022. URL https://arxiv.org/abs/2207.08799

arXiv 2022
[8]

L. Carroll. Phase transitions in neural networks. Master's thesis, School of Mathematics and Statistics, The University of Melbourne, 2021. URL http://therisingsea.org/notes/MSc-Carroll.pdf

2021
[9]

Chen and D

Z. Chen and D. Murfet. Modes of sequence models and learning coefficients, 2025. URL https://arxiv.org/abs/2504.18048

arXiv 2025
[10]

Z. Chen, E. Lau, J. Mendel, S. Wei, and D. Murfet. Dynamical versus B ayesian phase transitions in a toy model of superposition, 2023. URL https://arxiv.org/abs/2310.06301

arXiv 2023
[11]

de Br \'e bisson and P

A. de Br \'e bisson and P. Vincent. The Z -loss: A shift and scale invariant classification loss belonging to the spherical family. arXiv preprint arXiv:1604.08859, 2016. URL https://arxiv.org/abs/1604.08859

Pith/arXiv arXiv 2016
[12]

DePavia, V

A. DePavia, V. Charisopoulos, and R. Willett. How do simple rotations affect the implicit bias of Adam ? arXiv preprint arXiv:2510.23804, 2025. URL https://arxiv.org/abs/2510.23804

arXiv 2025
[13]

Dong, J.-B

Y. Dong, J.-B. Cordonnier, and A. Loukas. Attention is not all you need: pure attention loses rank doubly exponentially with depth. In International Conference on Machine Learning (ICML), 2021. URL https://arxiv.org/abs/2103.03404

arXiv 2021
[14]

Elhage, T

N. Elhage, T. Hume, C. Olsson, N. Nanda, T. Henighan, S. Johnston, S. E. Showk, N. Joseph, N. DasSarma, B. Mann, D. Hernandez, A. Askell, K. Ndousse, A. Jones, D. Drain, A. Chen, Y. Bai, D. Ganguli, L. Lovitt, Z. Hatfield-Dodds, J. Kernion, T. Conerly, S. Kravec, S. Fort, S. Kadavath, J. Jacobson, E. Tran-Johnson, J. Kaplan, J. Clark, T. Brown, S. McCandl...

2022
[15]

Farrugia-Roberts

M. Farrugia-Roberts. Structural degeneracy in neural networks. Master's thesis, School of Computing and Information Systems, The University of Melbourne, 2022. URL https://far.in.net/mthesis

2022
[16]

Farrugia-Roberts

M. Farrugia-Roberts. Functional equivalence and path connectivity of reducible hyperbolic tangent networks. In Advances in Neural Information Processing Systems 36 (NeurIPS), pages 79502--79517, 2023. URL https://arxiv.org/abs/2305.05089

arXiv 2023
[17]

Farrugia-Roberts

M. Farrugia-Roberts. Losslessly compressible neural network parameters. In Workshop on Machine Learning and Compression, NeurIPS, 2024. URL https://neurips.cc/virtual/2024/98217

2024
[18]

Gordon, G

A. Gordon, G. Baker, G. Wang, W. Snell, S. van Wingerden, and D. Murfet. Towards spectroscopy: Susceptibility clusters in language models, 2026. URL https://arxiv.org/abs/2601.12703

arXiv 2026
[19]

Hironaka

H. Hironaka. Resolution of singularities of an algebraic variety over a field of characteristic zero. Annals of Mathematics, 79 0 (1): 0 109--326, 1964. URL https://www.jstor.org/stable/1970486

arXiv 1964
[20]

Hoogland, G

J. Hoogland, G. Wang, M. Farrugia-Roberts, L. Carroll, S. Wei, and D. Murfet. Loss landscape degeneracy and stagewise development in transformers. Transactions on Machine Learning Research, 2024. URL https://arxiv.org/abs/2402.02364

arXiv 2024
[21]

J. Kim, B. Lee, C. Park, Y. Oh, B. Kim, T. Yoo, S. Shin, D. Han, J. Shin, and K. M. Yoo. Peri-LN : Revisiting normalization layer in the transformer architecture. arXiv preprint, 2025. URL https://arxiv.org/abs/2502.02732. Names the pre-norm + post-norm pattern ``Peri-LN'' and analyses its effect on activation magnitudes (linear vs exponential growth) and...

arXiv 2025
[22]

P. A. Kreer, W. Wu, M. Adam, Z. Furman, and J. Hoogland. B ayesian influence functions for hessian-free data attribution, 2025. URL https://arxiv.org/abs/2509.26544

arXiv 2025
[23]

Kunin, J

D. Kunin, J. Sagastuy-Brena, S. Ganguli, D. L. K. Yamins, and H. Tanaka. Neural mechanics: Symmetry and broken conservation laws in deep learning dynamics. In ICLR, 2021. URL https://arxiv.org/abs/2012.04728

arXiv 2021
[24]

Kunstner, L

F. Kunstner, L. Balles, and P. Hennig. Limitations of the empirical F isher approximation for natural gradient descent. In NeurIPS, 2019. URL https://arxiv.org/abs/1905.12558

arXiv 2019
[25]

E. Lau, Z. Furman, G. Wang, D. Murfet, and S. Wei. The local learning coefficient: A singularity-aware complexity measure. In AISTATS, 2025. URL https://proceedings.mlr.press/v258/lau25a.html

2025
[26]

J. H. Lee, M. Smith, M. Adam, and J. Hoogland. Influence dynamics and stagewise data attribution, 2025. URL https://arxiv.org/abs/2510.12071

Pith/arXiv arXiv 2025
[27]

Martens and R

J. Martens and R. Grosse. Optimizing neural networks with Kronecker -factored approximate curvature. In ICML, 2015. URL https://arxiv.org/abs/1503.05671

arXiv 2015
[28]

Murfet and W

D. Murfet and W. Troiani. Programs as singularities, 2025. URL https://arxiv.org/abs/2504.08075

arXiv 2025
[29]

Nanda, L

N. Nanda, L. Chan, T. Lieberum, J. Smith, and J. Steinhardt. Progress measures for grokking via mechanistic interpretability. In ICLR, 2023. URL https://arxiv.org/abs/2301.05217

Pith/arXiv arXiv 2023
[30]

L. Noci, S. Anagnostidis, L. Biggio, A. Orvieto, S. P. Singh, and A. Lucchi. Signal propagation in transformers: Theoretical perspectives and the role of rank collapse. In Advances in Neural Information Processing Systems (NeurIPS), 2022. URL https://arxiv.org/abs/2206.03126

arXiv 2022
[31]

V. Papyan. Traces of class/cross-class structure pervade deep learning spectra. JMLR, 21 0 (252): 0 1--64, 2020. URL https://jmlr.org/papers/volume21/20-933/20-933.pdf

2020
[32]

URL https://www.pnas.org/doi/abs/10.1073/pnas

V. Papyan, X. Y. Han, and D. L. Donoho. Prevalence of neural collapse during the terminal phase of deep learning training. Proceedings of the National Academy of Sciences, 117 0 (40): 0 24652--24663, 2020. URL https://doi.org/10.1073/pnas.2015509117

work page doi:10.1073/pnas.2015509117 2020
[33]

Pesme, L

S. Pesme, L. Pillaud-Vivien, and N. Flammarion. Implicit bias of SGD for diagonal linear networks: A provable benefit of stochasticity. In NeurIPS, 2021. URL https://arxiv.org/abs/2106.09524

arXiv 2021
[34]

Power, Y

A. Power, Y. Burda, H. Edwards, I. Babuschkin, and V. Misra. Grokking: Generalization beyond overfitting on small algorithmic datasets. arXiv:2201.02177, 2022

Pith/arXiv arXiv 2022
[35]

Shazeer, Y

N. Shazeer, Y. Cheng, N. Parmar, D. Tran, A. Vaswani, P. Koanantakool, P. Hawkins, H. Lee, M. Hong, C. Young, R. Sepassi, and B. Hechtman. Mesh- TensorFlow : Deep learning for supercomputers. In Advances in Neural Information Processing Systems (NeurIPS), 2018. URL https://arxiv.org/abs/1811.02084

Pith/arXiv arXiv 2018
[36]

M. Sun, X. Chen, J. Z. Kolter, and Z. Liu. Massive activations in large language models. In COLM, 2024. URL https://arxiv.org/abs/2402.17762

Pith/arXiv arXiv 2024
[37]

Tanaka and D

H. Tanaka and D. Kunin. Noether's learning dynamics: Role of symmetry breaking in neural networks. In NeurIPS, 2021. URL https://arxiv.org/abs/2105.02716

arXiv 2021
[38]

Urdshals, E

E. Urdshals, E. Lau, J. Hoogland, S. van Wingerden, and D. Murfet. Compressibility measures complexity: Minimum description length meets singular learning theory, 2025. URL https://arxiv.org/abs/2510.12077

arXiv 2025
[39]

G. Wang, J. Hoogland, S. van Wingerden, Z. Furman, and D. Murfet. Differentiation and specialization of attention heads via the refined local learning coefficient, 2024. URL https://arxiv.org/abs/2410.02984

arXiv 2024
[40]

Watanabe

S. Watanabe. Almost all learning machines are singular. In IEEE Symposium on Foundations of Computational Intelligence, pages 383--388, 2007. URL https://ieeexplore.ieee.org/document/4233934

arXiv 2007
[41]

Cambridge Monographs on Applied and Computational Mathematics, vol

S. Watanabe. Algebraic Geometry and Statistical Learning Theory. Cambridge University Press, 2009. URL https://doi.org/10.1017/CBO9780511800474

work page doi:10.1017/cbo9780511800474 2009
[42]

Watanabe

S. Watanabe. Mathematical Theory of B ayesian Statistics . CRC Press, 2018. URL https://www.routledge.com/9781482238068

arXiv 2018
[43]

S. Wei, D. Murfet, M. Gong, H. Li, J. Gell-Redman, and T. Quella. Deep learning is singular, and that's good. IEEE Transactions on Neural Networks and Learning Systems, 34 0 (12): 0 10473--10486, 2023. URL https://ieeexplore.ieee.org/document/9812468

arXiv 2023
[44]

B. Zoph, I. Bello, S. Kumar, N. Du, Y. Huang, J. Dean, N. Shazeer, and W. Fedus. ST-MoE : Designing stable and transferable sparse expert models. arXiv preprint arXiv:2202.08906, 2022. URL https://arxiv.org/abs/2202.08906

Pith/arXiv arXiv 2022

[1] [1]

M. Adam, Z. Furman, and J. Hoogland. The loss kernel: A geometric probe for deep learning interpretability, 2025. URL https://arxiv.org/abs/2509.26537

arXiv 2025

[2] [2]

S.-i. Amari. Information Geometry and Its Applications, volume 194 of Applied Mathematical Sciences. Springer, 2016. URL https://link.springer.com/book/10.1007/978-4-431-55978-8

work page doi:10.1007/978-4-431-55978-8 2016

[3] [3]

Amari, H

S.-i. Amari, H. Park, and T. Ozeki. Singularities affect dynamics of learning in neuromanifolds. Neural Computation, 18 0 (5): 0 1007--1065, 2006. URL https://doi.org/10.1162/neco.2006.18.5.1007

work page doi:10.1162/neco.2006.18.5.1007 2006

[4] [4]

M. Aoyagi. Consideration on the learning efficiency of multiple-layered neural networks with linear units. Neural Networks, 172: 0 106132, 2024. URL https://doi.org/10.1016/j.neunet.2024.106132

work page doi:10.1016/j.neunet.2024.106132 2024

[5] [5]

Aoyagi and S

M. Aoyagi and S. Watanabe. Stochastic complexities of reduced rank regression in B ayesian estimation. Neural Networks, 18 0 (7): 0 924--933, 2005. URL https://doi.org/10.1016/j.neunet.2005.03.014

work page doi:10.1016/j.neunet.2005.03.014 2005

[6] [6]

Baker, G

G. Baker, G. Wang, J. Hoogland, and D. Murfet. Structural inference: Interpreting small language models with susceptibilities, 2025. URL https://arxiv.org/abs/2504.18274

arXiv 2025

[7] [7]

Barak, B

B. Barak, B. L. Edelman, S. Goel, S. Kakade, E. Malach, and C. Zhang. Hidden progress in deep learning: SGD learns parities near the computational limit. In NeurIPS, 2022. URL https://arxiv.org/abs/2207.08799

arXiv 2022

[8] [8]

L. Carroll. Phase transitions in neural networks. Master's thesis, School of Mathematics and Statistics, The University of Melbourne, 2021. URL http://therisingsea.org/notes/MSc-Carroll.pdf

2021

[9] [9]

Chen and D

Z. Chen and D. Murfet. Modes of sequence models and learning coefficients, 2025. URL https://arxiv.org/abs/2504.18048

arXiv 2025

[10] [10]

Z. Chen, E. Lau, J. Mendel, S. Wei, and D. Murfet. Dynamical versus B ayesian phase transitions in a toy model of superposition, 2023. URL https://arxiv.org/abs/2310.06301

arXiv 2023

[11] [11]

de Br \'e bisson and P

A. de Br \'e bisson and P. Vincent. The Z -loss: A shift and scale invariant classification loss belonging to the spherical family. arXiv preprint arXiv:1604.08859, 2016. URL https://arxiv.org/abs/1604.08859

Pith/arXiv arXiv 2016

[12] [12]

DePavia, V

A. DePavia, V. Charisopoulos, and R. Willett. How do simple rotations affect the implicit bias of Adam ? arXiv preprint arXiv:2510.23804, 2025. URL https://arxiv.org/abs/2510.23804

arXiv 2025

[13] [13]

Dong, J.-B

Y. Dong, J.-B. Cordonnier, and A. Loukas. Attention is not all you need: pure attention loses rank doubly exponentially with depth. In International Conference on Machine Learning (ICML), 2021. URL https://arxiv.org/abs/2103.03404

arXiv 2021

[14] [14]

Elhage, T

N. Elhage, T. Hume, C. Olsson, N. Nanda, T. Henighan, S. Johnston, S. E. Showk, N. Joseph, N. DasSarma, B. Mann, D. Hernandez, A. Askell, K. Ndousse, A. Jones, D. Drain, A. Chen, Y. Bai, D. Ganguli, L. Lovitt, Z. Hatfield-Dodds, J. Kernion, T. Conerly, S. Kravec, S. Fort, S. Kadavath, J. Jacobson, E. Tran-Johnson, J. Kaplan, J. Clark, T. Brown, S. McCandl...

2022

[15] [15]

Farrugia-Roberts

M. Farrugia-Roberts. Structural degeneracy in neural networks. Master's thesis, School of Computing and Information Systems, The University of Melbourne, 2022. URL https://far.in.net/mthesis

2022

[16] [16]

Farrugia-Roberts

M. Farrugia-Roberts. Functional equivalence and path connectivity of reducible hyperbolic tangent networks. In Advances in Neural Information Processing Systems 36 (NeurIPS), pages 79502--79517, 2023. URL https://arxiv.org/abs/2305.05089

arXiv 2023

[17] [17]

Farrugia-Roberts

M. Farrugia-Roberts. Losslessly compressible neural network parameters. In Workshop on Machine Learning and Compression, NeurIPS, 2024. URL https://neurips.cc/virtual/2024/98217

2024

[18] [18]

Gordon, G

A. Gordon, G. Baker, G. Wang, W. Snell, S. van Wingerden, and D. Murfet. Towards spectroscopy: Susceptibility clusters in language models, 2026. URL https://arxiv.org/abs/2601.12703

arXiv 2026

[19] [19]

Hironaka

H. Hironaka. Resolution of singularities of an algebraic variety over a field of characteristic zero. Annals of Mathematics, 79 0 (1): 0 109--326, 1964. URL https://www.jstor.org/stable/1970486

arXiv 1964

[20] [20]

Hoogland, G

J. Hoogland, G. Wang, M. Farrugia-Roberts, L. Carroll, S. Wei, and D. Murfet. Loss landscape degeneracy and stagewise development in transformers. Transactions on Machine Learning Research, 2024. URL https://arxiv.org/abs/2402.02364

arXiv 2024

[21] [21]

J. Kim, B. Lee, C. Park, Y. Oh, B. Kim, T. Yoo, S. Shin, D. Han, J. Shin, and K. M. Yoo. Peri-LN : Revisiting normalization layer in the transformer architecture. arXiv preprint, 2025. URL https://arxiv.org/abs/2502.02732. Names the pre-norm + post-norm pattern ``Peri-LN'' and analyses its effect on activation magnitudes (linear vs exponential growth) and...

arXiv 2025

[22] [22]

P. A. Kreer, W. Wu, M. Adam, Z. Furman, and J. Hoogland. B ayesian influence functions for hessian-free data attribution, 2025. URL https://arxiv.org/abs/2509.26544

arXiv 2025

[23] [23]

Kunin, J

D. Kunin, J. Sagastuy-Brena, S. Ganguli, D. L. K. Yamins, and H. Tanaka. Neural mechanics: Symmetry and broken conservation laws in deep learning dynamics. In ICLR, 2021. URL https://arxiv.org/abs/2012.04728

arXiv 2021

[24] [24]

Kunstner, L

F. Kunstner, L. Balles, and P. Hennig. Limitations of the empirical F isher approximation for natural gradient descent. In NeurIPS, 2019. URL https://arxiv.org/abs/1905.12558

arXiv 2019

[25] [25]

E. Lau, Z. Furman, G. Wang, D. Murfet, and S. Wei. The local learning coefficient: A singularity-aware complexity measure. In AISTATS, 2025. URL https://proceedings.mlr.press/v258/lau25a.html

2025

[26] [26]

J. H. Lee, M. Smith, M. Adam, and J. Hoogland. Influence dynamics and stagewise data attribution, 2025. URL https://arxiv.org/abs/2510.12071

Pith/arXiv arXiv 2025

[27] [27]

Martens and R

J. Martens and R. Grosse. Optimizing neural networks with Kronecker -factored approximate curvature. In ICML, 2015. URL https://arxiv.org/abs/1503.05671

arXiv 2015

[28] [28]

Murfet and W

D. Murfet and W. Troiani. Programs as singularities, 2025. URL https://arxiv.org/abs/2504.08075

arXiv 2025

[29] [29]

Nanda, L

N. Nanda, L. Chan, T. Lieberum, J. Smith, and J. Steinhardt. Progress measures for grokking via mechanistic interpretability. In ICLR, 2023. URL https://arxiv.org/abs/2301.05217

Pith/arXiv arXiv 2023

[30] [30]

L. Noci, S. Anagnostidis, L. Biggio, A. Orvieto, S. P. Singh, and A. Lucchi. Signal propagation in transformers: Theoretical perspectives and the role of rank collapse. In Advances in Neural Information Processing Systems (NeurIPS), 2022. URL https://arxiv.org/abs/2206.03126

arXiv 2022

[31] [31]

V. Papyan. Traces of class/cross-class structure pervade deep learning spectra. JMLR, 21 0 (252): 0 1--64, 2020. URL https://jmlr.org/papers/volume21/20-933/20-933.pdf

2020

[32] [32]

URL https://www.pnas.org/doi/abs/10.1073/pnas

V. Papyan, X. Y. Han, and D. L. Donoho. Prevalence of neural collapse during the terminal phase of deep learning training. Proceedings of the National Academy of Sciences, 117 0 (40): 0 24652--24663, 2020. URL https://doi.org/10.1073/pnas.2015509117

work page doi:10.1073/pnas.2015509117 2020

[33] [33]

Pesme, L

S. Pesme, L. Pillaud-Vivien, and N. Flammarion. Implicit bias of SGD for diagonal linear networks: A provable benefit of stochasticity. In NeurIPS, 2021. URL https://arxiv.org/abs/2106.09524

arXiv 2021

[34] [34]

Power, Y

A. Power, Y. Burda, H. Edwards, I. Babuschkin, and V. Misra. Grokking: Generalization beyond overfitting on small algorithmic datasets. arXiv:2201.02177, 2022

Pith/arXiv arXiv 2022

[35] [35]

Shazeer, Y

N. Shazeer, Y. Cheng, N. Parmar, D. Tran, A. Vaswani, P. Koanantakool, P. Hawkins, H. Lee, M. Hong, C. Young, R. Sepassi, and B. Hechtman. Mesh- TensorFlow : Deep learning for supercomputers. In Advances in Neural Information Processing Systems (NeurIPS), 2018. URL https://arxiv.org/abs/1811.02084

Pith/arXiv arXiv 2018

[36] [36]

M. Sun, X. Chen, J. Z. Kolter, and Z. Liu. Massive activations in large language models. In COLM, 2024. URL https://arxiv.org/abs/2402.17762

Pith/arXiv arXiv 2024

[37] [37]

Tanaka and D

H. Tanaka and D. Kunin. Noether's learning dynamics: Role of symmetry breaking in neural networks. In NeurIPS, 2021. URL https://arxiv.org/abs/2105.02716

arXiv 2021

[38] [38]

Urdshals, E

E. Urdshals, E. Lau, J. Hoogland, S. van Wingerden, and D. Murfet. Compressibility measures complexity: Minimum description length meets singular learning theory, 2025. URL https://arxiv.org/abs/2510.12077

arXiv 2025

[39] [39]

G. Wang, J. Hoogland, S. van Wingerden, Z. Furman, and D. Murfet. Differentiation and specialization of attention heads via the refined local learning coefficient, 2024. URL https://arxiv.org/abs/2410.02984

arXiv 2024

[40] [40]

Watanabe

S. Watanabe. Almost all learning machines are singular. In IEEE Symposium on Foundations of Computational Intelligence, pages 383--388, 2007. URL https://ieeexplore.ieee.org/document/4233934

arXiv 2007

[41] [41]

Cambridge Monographs on Applied and Computational Mathematics, vol

S. Watanabe. Algebraic Geometry and Statistical Learning Theory. Cambridge University Press, 2009. URL https://doi.org/10.1017/CBO9780511800474

work page doi:10.1017/cbo9780511800474 2009

[42] [42]

Watanabe

S. Watanabe. Mathematical Theory of B ayesian Statistics . CRC Press, 2018. URL https://www.routledge.com/9781482238068

arXiv 2018

[43] [43]

S. Wei, D. Murfet, M. Gong, H. Li, J. Gell-Redman, and T. Quella. Deep learning is singular, and that's good. IEEE Transactions on Neural Networks and Learning Systems, 34 0 (12): 0 10473--10486, 2023. URL https://ieeexplore.ieee.org/document/9812468

arXiv 2023

[44] [44]

B. Zoph, I. Bello, S. Kumar, N. Du, Y. Huang, J. Dean, N. Shazeer, and W. Fedus. ST-MoE : Designing stable and transferable sparse expert models. arXiv preprint arXiv:2202.08906, 2022. URL https://arxiv.org/abs/2202.08906

Pith/arXiv arXiv 2022