pith. sign in

arxiv: 2606.29176 · v1 · pith:6OTCPCX3new · submitted 2026-06-28 · 💻 cs.LG · math.DG· math.OC· stat.ML

Dead-Direction Conditioners: Gauge-Equivariant Preconditioning for Deep Networks

Pith reviewed 2026-06-30 07:56 UTC · model grok-4.3

classification 💻 cs.LG math.DGmath.OCstat.ML
keywords gauge equivariancedead direction conditioneroptimizer preconditioningneural network symmetriesquotient manifoldAdam optimizerMuon optimizerover-training collapse
0
0 comments X

The pith

Making optimizers gauge-equivariant by conditioning on symmetry orbits keeps trajectories on the loss quotient and changes the minimum reached.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that a deep network loss stays unchanged under continuous parameter symmetries such as logit shifts, ReLU rescalings, LayerNorm scales and per-head attention rotations. Standard preconditioners drift along those orbits and move the trajectory away from the quotient manifold on which the effective optimization occurs. DDC corrects this by lifting a base optimizer into a G-equivariant form that preconditions its state using the orbit decomposition of a G-invariant metric. The result is a trajectory that remains a preconditioned gradient flow on the quotient, which in turn alters both the minimum found and the geometric quantities that can be read from it. Experiments on language models, vision transformers and grokking tasks show the practical effects of this change.

Core claim

A deep network's loss is invariant to continuous symmetries of its parameters: the logit shift, the ReLU rescaling, the LayerNorm scale, the per-head attention rotation. Adam's per-coordinate preconditioner drifts along each symmetry orbit, which pulls the trajectory off the symmetry quotient where the optimization lives and blurs the singular-learning rate the quotient makes readable. We build DDC, a Dead-Direction Conditioner that lifts a base optimizer into a G-equivariant one by conditioning the optimizer's state in the orbit decomposition of a G-invariant metric, so the trajectory stays a preconditioned gradient flow on the quotient. The construction carries four architectural gauges, p

What carries the argument

The Dead-Direction Conditioner, which preconditions optimizer state in the orbit decomposition of a G-invariant metric to enforce G-equivariance and keep the trajectory on the parameter quotient.

If this is right

  • On language models trained past the point of fit, DDCAdam holds a validation-train loss gap of 0.67 against 5.88 for AdamW.
  • DDCAdam reads the dead-direction rate in 32 of 65 layer-by-observable cells where AdamW reads it in 7.
  • A vision transformer trained from scratch with DDCAdam reaches validation loss 1.71 against 2.12 for AdamW while compressing spare feed-forward capacity.
  • On a Muon base, DDCMuon groks ten of eleven seeds at depth 24 where plain Muon reaches none.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • If the quotient geometry is preserved, quantities from singular learning theory such as the learning coefficient may become directly estimable from the trained network without additional post-processing.
  • The same orbit-decomposition approach could be applied to other known symmetries in convolutional or recurrent architectures to test whether similar stabilization occurs.
  • Optimizer design that explicitly accounts for the symmetry group of the loss may reduce the need for manual hyper-parameter tuning that currently compensates for orbit drift.

Load-bearing premise

The four listed architectural gauges capture the relevant continuous symmetries and the orbit decomposition preconditions without introducing new artifacts or changing the effective loss landscape.

What would settle it

A direct computation of the distance between the DDC trajectory and the symmetry quotient manifold, or a controlled run in which enforcing the gauges leaves the validation-train loss gap and dead-direction detection rates unchanged.

Figures

Figures reproduced from arXiv: 2606.29176 by Tejas Pradeep Shirodkar.

Figure 1
Figure 1. Figure 1: The construction in one picture. Left: the loss is curved across the quotient [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: The four architectural gauge groups. Each acts on parameter space and leaves the [PITH_FULL_IMAGE:figures/full_fig_p005_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: The DDCAdam update of Equation (1). The gradient splits along the 𝜌-orthogonal projectors into an orbit (vertical) and a horizontal component. The vertical is collapsed to one scalar per orbit dimension, normalised, and lifted back; the horizontal takes per-coordinate Adam and is re-projected to stay horizontal. The re-projection is what keeps the trajectory a gradient flow on the quotient. re-projection i… view at source ↗
Figure 4
Figure 4. Figure 4: The dead-direction observable suite over training separates the four optimisers, on [PITH_FULL_IMAGE:figures/full_fig_p015_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Descent versus gauge on the readout, three seeds, [PITH_FULL_IMAGE:figures/full_fig_p016_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: The radial coordinate decides whether a load-bearing block survives, on sparse [PITH_FULL_IMAGE:figures/full_fig_p018_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Weight decay or projection, on the one-block modular-addition transformer (Muon [PITH_FULL_IMAGE:figures/full_fig_p019_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Matching the rotation gauge to the rotary attention geometry. Post-grok accuracy [PITH_FULL_IMAGE:figures/full_fig_p020_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: Reliability at depth 24, per-seed accuracy trajectories pooled across a multi-condition [PITH_FULL_IMAGE:figures/full_fig_p021_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: The minimum the gauge reaches, on the depth-8 grok bench with the true-MC [PITH_FULL_IMAGE:figures/full_fig_p022_10.png] view at source ↗
Figure 11
Figure 11. Figure 11: Beyond grokking, a vision transformer trained from scratch on ImageNet-100 (3 [PITH_FULL_IMAGE:figures/full_fig_p024_11.png] view at source ↗
Figure 12
Figure 12. Figure 12: Z-loss inflates the gauge mode it is meant to contain (grokking transformer, single [PITH_FULL_IMAGE:figures/full_fig_p053_12.png] view at source ↗
Figure 13
Figure 13. Figure 13: The ReLU-rescale gauge on a two-layer teacher-student network (full-batch MSE, [PITH_FULL_IMAGE:figures/full_fig_p054_13.png] view at source ↗
Figure 14
Figure 14. Figure 14: The LayerNorm-scale gauge on the synthetic block [PITH_FULL_IMAGE:figures/full_fig_p055_14.png] view at source ↗
Figure 15
Figure 15. Figure 15: Gauge containment on the Pesme diagonal linear network ( [PITH_FULL_IMAGE:figures/full_fig_p056_15.png] view at source ↗
Figure 16
Figure 16. Figure 16: Reading the rate at language-model scale, on the depth-12 over-trained model (3 [PITH_FULL_IMAGE:figures/full_fig_p057_16.png] view at source ↗
Figure 17
Figure 17. Figure 17: The 65 layer-by-observable cells of the over-training read (depth-12 LM, seed [PITH_FULL_IMAGE:figures/full_fig_p058_17.png] view at source ↗
Figure 18
Figure 18. Figure 18: Interior activation compression at depth 24, read at the smallest singular value of [PITH_FULL_IMAGE:figures/full_fig_p059_18.png] view at source ↗
Figure 19
Figure 19. Figure 19: Per-seed grok trajectories for the four-arm rotation decomposition (depth-8 RoPE, [PITH_FULL_IMAGE:figures/full_fig_p060_19.png] view at source ↗
Figure 20
Figure 20. Figure 20: The true-MC Fisher smallest eigenvalue over training on the depth-8 grok bench [PITH_FULL_IMAGE:figures/full_fig_p061_20.png] view at source ↗
Figure 21
Figure 21. Figure 21: Dead-basis axis alignment over training at matched weight decay ( [PITH_FULL_IMAGE:figures/full_fig_p062_21.png] view at source ↗
Figure 22
Figure 22. Figure 22: ViT compression dynamics over training (ImageNet-100 from scratch, matched [PITH_FULL_IMAGE:figures/full_fig_p063_22.png] view at source ↗
Figure 23
Figure 23. Figure 23: The grok phenomenon on the one-block mod-113 testbed (3 seeds). Train accuracy [PITH_FULL_IMAGE:figures/full_fig_p064_23.png] view at source ↗
Figure 24
Figure 24. Figure 24: Per-optimizer learning-rate by weight-decay tuning on the mod-113 grokking [PITH_FULL_IMAGE:figures/full_fig_p065_24.png] view at source ↗
Figure 25
Figure 25. Figure 25: Gauge versus base optimizer across the grok grid (mod-113, relu [PITH_FULL_IMAGE:figures/full_fig_p067_25.png] view at source ↗
Figure 26
Figure 26. Figure 26: Activation effective rank at each residual node on [PITH_FULL_IMAGE:figures/full_fig_p068_26.png] view at source ↗
Figure 27
Figure 27. Figure 27: Keeping the load-bearing block and freezing the gauge mode are separate proper [PITH_FULL_IMAGE:figures/full_fig_p069_27.png] view at source ↗
Figure 28
Figure 28. Figure 28: Step-by-step 𝐺-equivariance for the five gauge constructions. Each line is the relative deviation of a gauge-shifted parameter copy from the gauge image of the reference copy, recorded at every step of 50 DDCAdam steps fed 𝐺-related gradients (fp32). All five stay at machine precision and accumulate only the expected fp32 round-off, holding below the 10−6 bound throughout [PITH_FULL_IMAGE:figures/full_fi… view at source ↗
read the original abstract

A deep network's loss is invariant to continuous symmetries of its parameters: the logit shift, the ReLU rescaling, the LayerNorm scale, the per-head attention rotation. Adam's per-coordinate preconditioner drifts along each symmetry orbit, which pulls the trajectory off the symmetry quotient where the optimization lives and blurs the singular-learning rate the quotient makes readable. We build DDC, a Dead-Direction Conditioner that lifts a base optimizer into a $G$-equivariant one: it conditions the optimizer's state in the orbit decomposition of a $G$-invariant metric, so the trajectory stays a preconditioned gradient flow on the quotient $\bar\Theta = \Theta/G$. The construction carries four architectural gauges (cross-entropy shift, ReLU and SwiGLU rescaling, LayerNorm and RMSNorm scale, and a per-head $O(d_{\rm head})$ attention rotation matched to RoPE), proves exactly equivariant on an Adam base, and composes with a Muon base through a gauge-equivariant orthogonaliser. Respecting the symmetry changes both the minimum the optimizer reaches and what it leaves measurable there. On a language model trained past the point of fit, DDCAdam resists the over-training collapse AdamW falls into, holding a validation-train loss gap of 0.67 against 5.88, and reads the dead-direction rate in 32 of 65 layer-by-observable cells where AdamW reads it in 7. A vision transformer trained from scratch reaches lower validation loss (1.71 against 2.12) while compressing spare feed-forward capacity a matched AdamW leaves intact. On a Muon base, where the rotation gauge composes exactly, DDCMuon groks ten of eleven seeds at depth 24 that a plain Muon never reaches. Built into the optimizer, a network's gauge symmetry sharpens the minimum it finds and turns that minimum's geometry into something the trajectory can measure.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The paper introduces Dead-Direction Conditioners (DDC), a method to lift base optimizers (Adam, Muon) into G-equivariant forms for four architectural gauge symmetries (cross-entropy shift, ReLU/SwiGLU rescaling, LayerNorm/RMSNorm scale, per-head O(d_head) rotations matched to RoPE). It preconditions optimizer state via orbit decomposition of a G-invariant metric so that trajectories remain preconditioned gradient flows on the quotient ar{ heta} = heta / G. The construction asserts exact equivariance on Adam and exact composition on Muon; empirical results show DDCAdam resisting over-training collapse (val-train gap 0.67 vs 5.88), lower ViT validation loss (1.71 vs 2.12), and DDCMuon achieving grokking on 10/11 seeds at depth 24 where plain Muon fails, while also increasing measurable dead-direction rates.

Significance. If the orbit-decomposition construction yields an exact horizontal lift without new effective potentials or altered singular learning rates, the work supplies a symmetry-respecting preconditioner that directly improves both the reached minimum and the readability of quotient geometry. The explicit gauge list, claimed exact equivariance proofs, and reproducible empirical gaps on language-model over-training and grokking constitute concrete strengths; the approach could sharpen singular-learning-rate diagnostics and symmetry-aware optimization more broadly.

major comments (3)
  1. [§3] §3 (construction) and the abstract claim of 'preconditioned gradient flow on the quotient without new effective potentials': the specific G-invariant metric and its orbit decomposition are not shown to be canonical with respect to the loss geometry. Different invariant metrics can produce different horizontal lifts, so the reported minima (validation-train gap 0.67 vs 5.88; ViT loss 1.71 vs 2.12) are consistent with either successful quotient flow or implicit landscape modification; an explicit check that the chosen metric leaves the quotient Hessian spectrum unchanged is required.
  2. [§4] The proof of exact equivariance on Adam (abstract and §4) and the Muon orthogonaliser composition: the derivation details, including how the four gauges are lifted and how the preconditioner state is updated under the orbit decomposition, are not visible in the provided text, leaving the central 'exact equivariance' claim only partially verifiable.
  3. [Empirical section] Table reporting dead-direction rates (32/65 cells for DDCAdam vs 7/65 for AdamW) and grokking counts: these are load-bearing for the claim that respecting symmetry 'turns that minimum's geometry into something the trajectory can measure'; the layer-by-observable breakdown and seed statistics must be accompanied by controls confirming that the increase is not an artifact of the metric choice altering effective learning rates on the quotient.
minor comments (2)
  1. [§2] Notation for the quotient ar{ heta} = heta / G and the horizontal lift should be introduced with a short diagram or explicit coordinate chart in the first section where the orbit decomposition appears.
  2. [Abstract / §3] The four gauges are listed in the abstract; a compact table mapping each gauge to its group action and the corresponding invariant metric component would improve readability.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the detailed and constructive report. We address each major comment below and indicate planned revisions.

read point-by-point responses
  1. Referee: [§3] §3 (construction) and the abstract claim of 'preconditioned gradient flow on the quotient without new effective potentials': the specific G-invariant metric and its orbit decomposition are not shown to be canonical with respect to the loss geometry. Different invariant metrics can produce different horizontal lifts, so the reported minima (validation-train gap 0.67 vs 5.88; ViT loss 1.71 vs 2.12) are consistent with either successful quotient flow or implicit landscape modification; an explicit check that the chosen metric leaves the quotient Hessian spectrum unchanged is required.

    Authors: The Euclidean metric on parameter space is the canonical choice compatible with Adam's coordinate-wise structure and the architectural gauges; the orbit decomposition is constructed precisely so that the horizontal lift reproduces the base optimizer's preconditioned flow projected onto the quotient without adding effective potentials. While other invariant metrics could produce different lifts, the reported improvements are tied to this standard choice. We agree an explicit argument or numerical check confirming the quotient Hessian spectrum is unchanged would strengthen the claim and will add it to a revised §3. revision: yes

  2. Referee: [§4] The proof of exact equivariance on Adam (abstract and §4) and the Muon orthogonaliser composition: the derivation details, including how the four gauges are lifted and how the preconditioner state is updated under the orbit decomposition, are not visible in the provided text, leaving the central 'exact equivariance' claim only partially verifiable.

    Authors: The full derivations for lifting each gauge (cross-entropy shift, ReLU/SwiGLU rescaling, LayerNorm/RMSNorm scale, per-head O(d_head) rotation) and the state-update rules under orbit decomposition appear in Appendix B; the main-text §4 summarizes the resulting equivariance statements. We will expand the main-text presentation of the key lifting steps and Muon composition to make the proof self-contained without requiring the appendix. revision: yes

  3. Referee: [Empirical section] Table reporting dead-direction rates (32/65 cells for DDCAdam vs 7/65 for AdamW) and grokking counts: these are load-bearing for the claim that respecting symmetry 'turns that minimum's geometry into something the trajectory can measure'; the layer-by-observable breakdown and seed statistics must be accompanied by controls confirming that the increase is not an artifact of the metric choice altering effective learning rates on the quotient.

    Authors: We will augment the empirical section with controls that match effective learning rates on the quotient (via rescaled step-size ablations) and compare against non-equivariant runs that apply the same metric without orbit decomposition, thereby isolating the contribution of gauge-equivariant preconditioning from possible learning-rate side effects. revision: yes

Circularity Check

0 steps flagged

No circularity: explicit construction with independent empirical results

full rationale

The paper presents DDC as an explicit mathematical construction that lifts base optimizers (Adam, Muon) into G-equivariant forms via orbit decomposition of a G-invariant metric on four specified architectural gauges, with a claimed proof of exact equivariance. The central claims rest on this construction and on reported empirical outcomes (validation-train gaps, dead-direction detection rates, grokking seeds, ViT losses) rather than any reduction of those outcomes to quantities fitted inside the same experiment. No self-citations, self-definitional loops, fitted-input predictions, or smuggled ansatzes appear in the provided text that would make the derivation equivalent to its inputs by construction. The method is therefore self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The central claim rests on the existence of the listed parameter symmetries and the technical feasibility of an orbit-based G-invariant metric for preconditioning.

axioms (2)
  • domain assumption The network loss is invariant under the four listed continuous parameter symmetries (logit shift, ReLU/SwiGLU rescaling, LayerNorm/RMSNorm scale, per-head attention rotation).
    Stated as the starting point for the construction in the abstract.
  • domain assumption A G-invariant metric exists whose orbit decomposition yields a well-defined preconditioner on the quotient.
    Required for the equivariant lift described in the abstract.

pith-pipeline@v0.9.1-grok · 5902 in / 1447 out tokens · 47988 ms · 2026-06-30T07:56:13.311303+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

39 extracted references · 19 canonical work pages · 9 internal anchors

  1. [1]

    Absil, R

    P.-A. Absil, R. Mahony, and R. Sepulchre. Optimization Algorithms on Matrix Manifolds. Princeton University Press, 2008

  2. [2]

    On the Optimization of Deep Networks: Implicit Acceleration by Overparameterization

    S. Arora, N. Cohen, and E. Hazan. On the optimization of deep networks: Implicit acceleration by overparameterization. In ICML, 2018. URL https://arxiv.org/abs/1802.06509

  3. [3]

    Ashkboos, A

    S. Ashkboos, A. Mohtashami, M. L. Croci, et al. Quarot: Outlier-free 4-bit inference in rotated llms. In NeurIPS, 2024

  4. [4]

    Riemannian Adaptive Optimization Methods

    G. B\'ecigneul and O.-E. Ganea. Riemannian adaptive optimization methods. In ICLR, 2019. URL https://arxiv.org/abs/1810.00760

  5. [5]

    Bernstein and L

    J. Bernstein and L. Newhouse. Modular duality in deep learning, 2024. arXiv:2410.21265

  6. [6]

    Davis and W

    C. Davis and W. M. Kahan. The rotation of eigenvectors by a perturbation. III . SIAM Journal on Numerical Analysis, 7 0 (1): 0 1--46, 1970

  7. [7]

    The Z-loss: a shift and scale invariant classification loss belonging to the Spherical Family

    A. de Br \'e bisson and P. Vincent. The Z -loss: A shift and scale invariant classification loss belonging to the spherical family. arXiv preprint arXiv:1604.08859, 2016

  8. [8]

    M. F. DePavia, V. Charisopoulos, and R. Willett. How do simple rotations affect the implicit bias of Adam ? arXiv preprint arXiv:2510.23804, 2025

  9. [9]

    Filatov, J

    O. Filatov, J. Wang, J. Ebert, and S. Kesselheim. Optimal scaling needs optimal norm, 2025. arXiv:2510.03871

  10. [10]

    Gupta, T

    V. Gupta, T. Koren, and Y. Singer. Shampoo: Preconditioned stochastic tensor optimization. In ICML, 2018

  11. [11]

    Hu et al

    X. Hu et al. Ostquant: Refining large language model quantization with orthogonal and scaling transformations for better distribution fitting, 2025. ICLR 2025

  12. [12]

    Jordan, Y

    K. Jordan, Y. Jin, V. Boza, et al. Muon: An optimizer for hidden layers in neural networks, 2024. Online manuscript

  13. [13]

    D. P. Kingma and J. Ba. Adam : A method for stochastic optimization. In ICLR, 2015

  14. [14]

    Kosson, B

    A. Kosson, B. Messmer, and M. Jaggi. Rotational equilibrium: How weight decay balances learning across neural networks. arXiv preprint arXiv:2305.17212, 2024

  15. [15]

    Kunin, J

    D. Kunin, J. Sagastuy-Brena, S. Ganguli, D. L. K. Yamins, and H. Tanaka. Neural mechanics: Symmetry and broken conservation laws in deep learning dynamics. In ICLR, 2021

  16. [16]

    E. Lau, Z. Furman, G. Wang, D. Murfet, and S. Wei. The local learning coefficient: A singularity-aware complexity measure. In AISTATS, 2025

  17. [17]

    T. T.-K. Lau and W. J. Su. A symmetry-compatible principle for optimizer design: Embeddings, LM heads, SwiGLU MLPs , and MoE routers, 2026. arXiv:2605.18106

  18. [18]

    Q. Li, C. Tai, and W. E. Stochastic modified equations and dynamics of stochastic gradient algorithms I : Mathematical foundations. Journal of Machine Learning Research, 20 0 (40): 0 1--47, 2019. URL https://jmlr.org/papers/v20/17-526.html

  19. [19]

    Z. Li, L. Liu, C. Liang, W. Chen, and T. Zhao. NorMuon : Making Muon more efficient and scalable, 2025. URL https://arxiv.org/abs/2510.05491

  20. [20]

    S. Z. Ling, N. Sharp, and A. Jacobson. VectorAdam for rotation equivariant geometry optimization. In NeurIPS, 2022. URL https://arxiv.org/abs/2205.13599

  21. [21]

    Z. Liu, C. Zhao, I. Fedorov, et al. Spinquant: Llm quantization with learned rotations. In ICLR, 2024

  22. [22]

    Loshchilov and F

    I. Loshchilov and F. Hutter. Decoupled weight decay regularization. In ICLR, 2019

  23. [23]

    Martens and R

    J. Martens and R. Grosse. Optimizing neural networks with kronecker-factored approximate curvature. In ICML, 2015

  24. [24]

    Nanda, L

    N. Nanda, L. Chan, T. Lieberum, J. Smith, and J. Steinhardt. Progress measures for grokking via mechanistic interpretability. In ICLR, 2023

  25. [25]

    Pesme, L

    S. Pesme, L. Pillaud-Vivien, and N. Flammarion. Implicit bias of SGD for diagonal linear networks: A provable benefit of stochasticity. In NeurIPS, 2021

  26. [26]

    Training Deep Learning Models with Norm-Constrained LMOs

    T. Pethick et al. Training deep learning models with norm-constrained LMOs , 2025. arXiv:2502.07529; the Scion optimizer

  27. [27]

    Prieto, M

    L. Prieto, M. Barsbey, P. A. M. Mediano, and T. Birdal. Grokking at the edge of numerical stability. In ICLR, 2025. URL https://arxiv.org/abs/2501.04697

  28. [28]

    T. P. Shirodkar. Dead directions: Geometric singular learning, 2026. URL https://arxiv.org/abs/2606.05957

  29. [29]

    T. P. Shirodkar and P. J. Narayanan. Dead-direction signatures: A cheap spectral reading of singular complexity, 2026 a . URL https://arxiv.org/abs/2606.21158

  30. [30]

    T. P. Shirodkar and P. J. Narayanan. Algebraic dead directions in LayerNorm transformers: A forward-pass-only diagnostic at LLM scale, 2026 b . URL https://arxiv.org/abs/2606.19491

  31. [31]

    Silverstein, D

    E. Silverstein, D. Kunin, and V. Shyam. Symmetry breaking in transformers for efficient and interpretable training, 2026. URL https://arxiv.org/abs/2601.22257

  32. [32]

    Tanaka and D

    H. Tanaka and D. Kunin. Noether 's learning dynamics: Role of symmetry breaking in neural networks. In NeurIPS, 2021

  33. [33]

    L2 Regularization versus Batch and Weight Normalization

    T. van Laarhoven. L2 regularization versus batch and weight normalization. arXiv preprint arXiv:1706.05350, 2017

  34. [34]

    N. Vyas, D. Morwani, R. Zhao, I. Shapira, D. Brandfonbrener, L. Janson, and S. M. Kakade. SOAP : Improving and stabilizing Shampoo using Adam . In NeurIPS, 2024

  35. [35]

    R. Wan, Z. Zhu, X. Zhang, and J. Sun. Spherical motion dynamics: Learning dynamics of normalized neural network using SGD and weight decay. In NeurIPS, 2021

  36. [36]

    Watanabe

    S. Watanabe. Algebraic Geometry and Statistical Learning Theory. Cambridge University Press, 2009

  37. [37]

    S. Wei, D. Murfet, M. Gong, H. Li, J. Gell-Redman, and T. Quella. Deep learning is singular, and that's good. IEEE Transactions on Neural Networks and Learning Systems, 34 0 (12): 0 10473--10486, 2022

  38. [38]

    J.-N. Yen, S. Si, Z. Meng, F. Yu, S. S. Duvvuri, I. S. Dhillon, C.-J. Hsieh, and S. Kumar. LoRA done RITE : Robust invariant transformation equilibration for LoRA optimization. In ICLR, 2025. arXiv:2410.20625

  39. [39]

    B. Zhao, R. Walters, and R. Yu. Symmetry in neural network parameter spaces, 2025. arXiv:2506.13018