Dead-Direction Conditioners: Gauge-Equivariant Preconditioning for Deep Networks

Tejas Pradeep Shirodkar

arxiv: 2606.29176 · v1 · pith:6OTCPCX3new · submitted 2026-06-28 · 💻 cs.LG · math.DG· math.OC· stat.ML

Dead-Direction Conditioners: Gauge-Equivariant Preconditioning for Deep Networks

Tejas Pradeep Shirodkar This is my paper

Pith reviewed 2026-06-30 07:56 UTC · model grok-4.3

classification 💻 cs.LG math.DGmath.OCstat.ML

keywords gauge equivariancedead direction conditioneroptimizer preconditioningneural network symmetriesquotient manifoldAdam optimizerMuon optimizerover-training collapse

0 comments

The pith

Making optimizers gauge-equivariant by conditioning on symmetry orbits keeps trajectories on the loss quotient and changes the minimum reached.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that a deep network loss stays unchanged under continuous parameter symmetries such as logit shifts, ReLU rescalings, LayerNorm scales and per-head attention rotations. Standard preconditioners drift along those orbits and move the trajectory away from the quotient manifold on which the effective optimization occurs. DDC corrects this by lifting a base optimizer into a G-equivariant form that preconditions its state using the orbit decomposition of a G-invariant metric. The result is a trajectory that remains a preconditioned gradient flow on the quotient, which in turn alters both the minimum found and the geometric quantities that can be read from it. Experiments on language models, vision transformers and grokking tasks show the practical effects of this change.

Core claim

What carries the argument

The Dead-Direction Conditioner, which preconditions optimizer state in the orbit decomposition of a G-invariant metric to enforce G-equivariance and keep the trajectory on the parameter quotient.

If this is right

On language models trained past the point of fit, DDCAdam holds a validation-train loss gap of 0.67 against 5.88 for AdamW.
DDCAdam reads the dead-direction rate in 32 of 65 layer-by-observable cells where AdamW reads it in 7.
A vision transformer trained from scratch with DDCAdam reaches validation loss 1.71 against 2.12 for AdamW while compressing spare feed-forward capacity.
On a Muon base, DDCMuon groks ten of eleven seeds at depth 24 where plain Muon reaches none.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

If the quotient geometry is preserved, quantities from singular learning theory such as the learning coefficient may become directly estimable from the trained network without additional post-processing.
The same orbit-decomposition approach could be applied to other known symmetries in convolutional or recurrent architectures to test whether similar stabilization occurs.
Optimizer design that explicitly accounts for the symmetry group of the loss may reduce the need for manual hyper-parameter tuning that currently compensates for orbit drift.

Load-bearing premise

The four listed architectural gauges capture the relevant continuous symmetries and the orbit decomposition preconditions without introducing new artifacts or changing the effective loss landscape.

What would settle it

A direct computation of the distance between the DDC trajectory and the symmetry quotient manifold, or a controlled run in which enforcing the gauges leaves the validation-train loss gap and dead-direction detection rates unchanged.

Figures

Figures reproduced from arXiv: 2606.29176 by Tejas Pradeep Shirodkar.

**Figure 2.** Figure 2: The four architectural gauge groups. Each acts on parameter space and leaves the [PITH_FULL_IMAGE:figures/full_fig_p005_2.png] view at source ↗

**Figure 3.** Figure 3: The DDCAdam update of Equation (1). The gradient splits along the 𝜌-orthogonal projectors into an orbit (vertical) and a horizontal component. The vertical is collapsed to one scalar per orbit dimension, normalised, and lifted back; the horizontal takes per-coordinate Adam and is re-projected to stay horizontal. The re-projection is what keeps the trajectory a gradient flow on the quotient. re-projection i… view at source ↗

**Figure 4.** Figure 4: The dead-direction observable suite over training separates the four optimisers, on [PITH_FULL_IMAGE:figures/full_fig_p015_4.png] view at source ↗

**Figure 5.** Figure 5: Descent versus gauge on the readout, three seeds, [PITH_FULL_IMAGE:figures/full_fig_p016_5.png] view at source ↗

**Figure 6.** Figure 6: The radial coordinate decides whether a load-bearing block survives, on sparse [PITH_FULL_IMAGE:figures/full_fig_p018_6.png] view at source ↗

**Figure 7.** Figure 7: Weight decay or projection, on the one-block modular-addition transformer (Muon [PITH_FULL_IMAGE:figures/full_fig_p019_7.png] view at source ↗

**Figure 8.** Figure 8: Matching the rotation gauge to the rotary attention geometry. Post-grok accuracy [PITH_FULL_IMAGE:figures/full_fig_p020_8.png] view at source ↗

**Figure 9.** Figure 9: Reliability at depth 24, per-seed accuracy trajectories pooled across a multi-condition [PITH_FULL_IMAGE:figures/full_fig_p021_9.png] view at source ↗

**Figure 10.** Figure 10: The minimum the gauge reaches, on the depth-8 grok bench with the true-MC [PITH_FULL_IMAGE:figures/full_fig_p022_10.png] view at source ↗

**Figure 11.** Figure 11: Beyond grokking, a vision transformer trained from scratch on ImageNet-100 (3 [PITH_FULL_IMAGE:figures/full_fig_p024_11.png] view at source ↗

**Figure 12.** Figure 12: Z-loss inflates the gauge mode it is meant to contain (grokking transformer, single [PITH_FULL_IMAGE:figures/full_fig_p053_12.png] view at source ↗

**Figure 13.** Figure 13: The ReLU-rescale gauge on a two-layer teacher-student network (full-batch MSE, [PITH_FULL_IMAGE:figures/full_fig_p054_13.png] view at source ↗

**Figure 14.** Figure 14: The LayerNorm-scale gauge on the synthetic block [PITH_FULL_IMAGE:figures/full_fig_p055_14.png] view at source ↗

**Figure 15.** Figure 15: Gauge containment on the Pesme diagonal linear network ( [PITH_FULL_IMAGE:figures/full_fig_p056_15.png] view at source ↗

**Figure 16.** Figure 16: Reading the rate at language-model scale, on the depth-12 over-trained model (3 [PITH_FULL_IMAGE:figures/full_fig_p057_16.png] view at source ↗

**Figure 17.** Figure 17: The 65 layer-by-observable cells of the over-training read (depth-12 LM, seed [PITH_FULL_IMAGE:figures/full_fig_p058_17.png] view at source ↗

**Figure 18.** Figure 18: Interior activation compression at depth 24, read at the smallest singular value of [PITH_FULL_IMAGE:figures/full_fig_p059_18.png] view at source ↗

**Figure 19.** Figure 19: Per-seed grok trajectories for the four-arm rotation decomposition (depth-8 RoPE, [PITH_FULL_IMAGE:figures/full_fig_p060_19.png] view at source ↗

**Figure 20.** Figure 20: The true-MC Fisher smallest eigenvalue over training on the depth-8 grok bench [PITH_FULL_IMAGE:figures/full_fig_p061_20.png] view at source ↗

**Figure 21.** Figure 21: Dead-basis axis alignment over training at matched weight decay ( [PITH_FULL_IMAGE:figures/full_fig_p062_21.png] view at source ↗

**Figure 22.** Figure 22: ViT compression dynamics over training (ImageNet-100 from scratch, matched [PITH_FULL_IMAGE:figures/full_fig_p063_22.png] view at source ↗

**Figure 23.** Figure 23: The grok phenomenon on the one-block mod-113 testbed (3 seeds). Train accuracy [PITH_FULL_IMAGE:figures/full_fig_p064_23.png] view at source ↗

**Figure 24.** Figure 24: Per-optimizer learning-rate by weight-decay tuning on the mod-113 grokking [PITH_FULL_IMAGE:figures/full_fig_p065_24.png] view at source ↗

**Figure 25.** Figure 25: Gauge versus base optimizer across the grok grid (mod-113, relu [PITH_FULL_IMAGE:figures/full_fig_p067_25.png] view at source ↗

**Figure 26.** Figure 26: Activation effective rank at each residual node on [PITH_FULL_IMAGE:figures/full_fig_p068_26.png] view at source ↗

**Figure 27.** Figure 27: Keeping the load-bearing block and freezing the gauge mode are separate proper [PITH_FULL_IMAGE:figures/full_fig_p069_27.png] view at source ↗

**Figure 28.** Figure 28: Step-by-step 𝐺-equivariance for the five gauge constructions. Each line is the relative deviation of a gauge-shifted parameter copy from the gauge image of the reference copy, recorded at every step of 50 DDCAdam steps fed 𝐺-related gradients (fp32). All five stay at machine precision and accumulate only the expected fp32 round-off, holding below the 10−6 bound throughout [PITH_FULL_IMAGE:figures/full_fi… view at source ↗

read the original abstract

A deep network's loss is invariant to continuous symmetries of its parameters: the logit shift, the ReLU rescaling, the LayerNorm scale, the per-head attention rotation. Adam's per-coordinate preconditioner drifts along each symmetry orbit, which pulls the trajectory off the symmetry quotient where the optimization lives and blurs the singular-learning rate the quotient makes readable. We build DDC, a Dead-Direction Conditioner that lifts a base optimizer into a $G$-equivariant one: it conditions the optimizer's state in the orbit decomposition of a $G$-invariant metric, so the trajectory stays a preconditioned gradient flow on the quotient $\bar\Theta = \Theta/G$. The construction carries four architectural gauges (cross-entropy shift, ReLU and SwiGLU rescaling, LayerNorm and RMSNorm scale, and a per-head $O(d_{\rm head})$ attention rotation matched to RoPE), proves exactly equivariant on an Adam base, and composes with a Muon base through a gauge-equivariant orthogonaliser. Respecting the symmetry changes both the minimum the optimizer reaches and what it leaves measurable there. On a language model trained past the point of fit, DDCAdam resists the over-training collapse AdamW falls into, holding a validation-train loss gap of 0.67 against 5.88, and reads the dead-direction rate in 32 of 65 layer-by-observable cells where AdamW reads it in 7. A vision transformer trained from scratch reaches lower validation loss (1.71 against 2.12) while compressing spare feed-forward capacity a matched AdamW leaves intact. On a Muon base, where the rotation gauge composes exactly, DDCMuon groks ten of eleven seeds at depth 24 that a plain Muon never reaches. Built into the optimizer, a network's gauge symmetry sharpens the minimum it finds and turns that minimum's geometry into something the trajectory can measure.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

DDC builds explicit gauge-equivariant lifts of Adam and Muon for four common network symmetries, with runs showing reduced collapse and better minima, though the metric choice leaves room for unintended landscape changes.

read the letter

The core contribution is a concrete construction that preconditions optimizer state via orbit decomposition of a G-invariant metric so the trajectory respects the listed symmetries exactly on an Adam base. It also composes cleanly with Muon through a gauge-equivariant orthogonaliser.

What stands out is the explicit handling of the per-head attention rotation matched to RoPE, plus the four gauges covering shift, rescaling, and norm scale. The paper states an exact equivariance proof for Adam and reports measurable differences: a 0.67 versus 5.88 validation-train gap on the language model, lower ViT loss, and higher dead-direction detection rates. On Muon it reaches grokking on ten of eleven seeds where the base does not.

The soft spot is the choice of invariant metric. Different choices can produce different horizontal lifts, and nothing in the abstract or stress-test note shows the selected metric is canonical with respect to the loss geometry. The reported minima differences are consistent with either clean quotient flow or with the preconditioner implicitly altering singular learning rates. The empirical setups are narrow, so it is unclear how far the gains travel beyond the tested models and depths.

This is for people already working on geometric views of optimization or symmetry in training dynamics. A reader who cares about making existing optimizers respect parameter invariances will find the construction usable and the claims testable.

It deserves a serious referee. The construction is new enough and the empirical claims specific enough that peer review can check the derivations and run replications.

Referee Report

3 major / 2 minor

Summary. The paper introduces Dead-Direction Conditioners (DDC), a method to lift base optimizers (Adam, Muon) into G-equivariant forms for four architectural gauge symmetries (cross-entropy shift, ReLU/SwiGLU rescaling, LayerNorm/RMSNorm scale, per-head O(d_head) rotations matched to RoPE). It preconditions optimizer state via orbit decomposition of a G-invariant metric so that trajectories remain preconditioned gradient flows on the quotient ar{ heta} = heta / G. The construction asserts exact equivariance on Adam and exact composition on Muon; empirical results show DDCAdam resisting over-training collapse (val-train gap 0.67 vs 5.88), lower ViT validation loss (1.71 vs 2.12), and DDCMuon achieving grokking on 10/11 seeds at depth 24 where plain Muon fails, while also increasing measurable dead-direction rates.

Significance. If the orbit-decomposition construction yields an exact horizontal lift without new effective potentials or altered singular learning rates, the work supplies a symmetry-respecting preconditioner that directly improves both the reached minimum and the readability of quotient geometry. The explicit gauge list, claimed exact equivariance proofs, and reproducible empirical gaps on language-model over-training and grokking constitute concrete strengths; the approach could sharpen singular-learning-rate diagnostics and symmetry-aware optimization more broadly.

major comments (3)

[§3] §3 (construction) and the abstract claim of 'preconditioned gradient flow on the quotient without new effective potentials': the specific G-invariant metric and its orbit decomposition are not shown to be canonical with respect to the loss geometry. Different invariant metrics can produce different horizontal lifts, so the reported minima (validation-train gap 0.67 vs 5.88; ViT loss 1.71 vs 2.12) are consistent with either successful quotient flow or implicit landscape modification; an explicit check that the chosen metric leaves the quotient Hessian spectrum unchanged is required.
[§4] The proof of exact equivariance on Adam (abstract and §4) and the Muon orthogonaliser composition: the derivation details, including how the four gauges are lifted and how the preconditioner state is updated under the orbit decomposition, are not visible in the provided text, leaving the central 'exact equivariance' claim only partially verifiable.
[Empirical section] Table reporting dead-direction rates (32/65 cells for DDCAdam vs 7/65 for AdamW) and grokking counts: these are load-bearing for the claim that respecting symmetry 'turns that minimum's geometry into something the trajectory can measure'; the layer-by-observable breakdown and seed statistics must be accompanied by controls confirming that the increase is not an artifact of the metric choice altering effective learning rates on the quotient.

minor comments (2)

[§2] Notation for the quotient ar{ heta} = heta / G and the horizontal lift should be introduced with a short diagram or explicit coordinate chart in the first section where the orbit decomposition appears.
[Abstract / §3] The four gauges are listed in the abstract; a compact table mapping each gauge to its group action and the corresponding invariant metric component would improve readability.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the detailed and constructive report. We address each major comment below and indicate planned revisions.

read point-by-point responses

Referee: [§3] §3 (construction) and the abstract claim of 'preconditioned gradient flow on the quotient without new effective potentials': the specific G-invariant metric and its orbit decomposition are not shown to be canonical with respect to the loss geometry. Different invariant metrics can produce different horizontal lifts, so the reported minima (validation-train gap 0.67 vs 5.88; ViT loss 1.71 vs 2.12) are consistent with either successful quotient flow or implicit landscape modification; an explicit check that the chosen metric leaves the quotient Hessian spectrum unchanged is required.

Authors: The Euclidean metric on parameter space is the canonical choice compatible with Adam's coordinate-wise structure and the architectural gauges; the orbit decomposition is constructed precisely so that the horizontal lift reproduces the base optimizer's preconditioned flow projected onto the quotient without adding effective potentials. While other invariant metrics could produce different lifts, the reported improvements are tied to this standard choice. We agree an explicit argument or numerical check confirming the quotient Hessian spectrum is unchanged would strengthen the claim and will add it to a revised §3. revision: yes
Referee: [§4] The proof of exact equivariance on Adam (abstract and §4) and the Muon orthogonaliser composition: the derivation details, including how the four gauges are lifted and how the preconditioner state is updated under the orbit decomposition, are not visible in the provided text, leaving the central 'exact equivariance' claim only partially verifiable.

Authors: The full derivations for lifting each gauge (cross-entropy shift, ReLU/SwiGLU rescaling, LayerNorm/RMSNorm scale, per-head O(d_head) rotation) and the state-update rules under orbit decomposition appear in Appendix B; the main-text §4 summarizes the resulting equivariance statements. We will expand the main-text presentation of the key lifting steps and Muon composition to make the proof self-contained without requiring the appendix. revision: yes
Referee: [Empirical section] Table reporting dead-direction rates (32/65 cells for DDCAdam vs 7/65 for AdamW) and grokking counts: these are load-bearing for the claim that respecting symmetry 'turns that minimum's geometry into something the trajectory can measure'; the layer-by-observable breakdown and seed statistics must be accompanied by controls confirming that the increase is not an artifact of the metric choice altering effective learning rates on the quotient.

Authors: We will augment the empirical section with controls that match effective learning rates on the quotient (via rescaled step-size ablations) and compare against non-equivariant runs that apply the same metric without orbit decomposition, thereby isolating the contribution of gauge-equivariant preconditioning from possible learning-rate side effects. revision: yes

Circularity Check

0 steps flagged

No circularity: explicit construction with independent empirical results

full rationale

The paper presents DDC as an explicit mathematical construction that lifts base optimizers (Adam, Muon) into G-equivariant forms via orbit decomposition of a G-invariant metric on four specified architectural gauges, with a claimed proof of exact equivariance. The central claims rest on this construction and on reported empirical outcomes (validation-train gaps, dead-direction detection rates, grokking seeds, ViT losses) rather than any reduction of those outcomes to quantities fitted inside the same experiment. No self-citations, self-definitional loops, fitted-input predictions, or smuggled ansatzes appear in the provided text that would make the derivation equivalent to its inputs by construction. The method is therefore self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The central claim rests on the existence of the listed parameter symmetries and the technical feasibility of an orbit-based G-invariant metric for preconditioning.

axioms (2)

domain assumption The network loss is invariant under the four listed continuous parameter symmetries (logit shift, ReLU/SwiGLU rescaling, LayerNorm/RMSNorm scale, per-head attention rotation).
Stated as the starting point for the construction in the abstract.
domain assumption A G-invariant metric exists whose orbit decomposition yields a well-defined preconditioner on the quotient.
Required for the equivariant lift described in the abstract.

pith-pipeline@v0.9.1-grok · 5902 in / 1447 out tokens · 47988 ms · 2026-06-30T07:56:13.311303+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

39 extracted references · 19 canonical work pages · 9 internal anchors

[1]

Absil, R

P.-A. Absil, R. Mahony, and R. Sepulchre. Optimization Algorithms on Matrix Manifolds. Princeton University Press, 2008

2008
[2]

On the Optimization of Deep Networks: Implicit Acceleration by Overparameterization

S. Arora, N. Cohen, and E. Hazan. On the optimization of deep networks: Implicit acceleration by overparameterization. In ICML, 2018. URL https://arxiv.org/abs/1802.06509

work page internal anchor Pith review Pith/arXiv arXiv 2018
[3]

Ashkboos, A

S. Ashkboos, A. Mohtashami, M. L. Croci, et al. Quarot: Outlier-free 4-bit inference in rotated llms. In NeurIPS, 2024

2024
[4]

Riemannian Adaptive Optimization Methods

G. B\'ecigneul and O.-E. Ganea. Riemannian adaptive optimization methods. In ICLR, 2019. URL https://arxiv.org/abs/1810.00760

work page internal anchor Pith review Pith/arXiv arXiv 2019
[5]

Bernstein and L

J. Bernstein and L. Newhouse. Modular duality in deep learning, 2024. arXiv:2410.21265

work page arXiv 2024
[6]

Davis and W

C. Davis and W. M. Kahan. The rotation of eigenvectors by a perturbation. III . SIAM Journal on Numerical Analysis, 7 0 (1): 0 1--46, 1970

1970
[7]

The Z-loss: a shift and scale invariant classification loss belonging to the Spherical Family

A. de Br \'e bisson and P. Vincent. The Z -loss: A shift and scale invariant classification loss belonging to the spherical family. arXiv preprint arXiv:1604.08859, 2016

work page internal anchor Pith review Pith/arXiv arXiv 2016
[8]

M. F. DePavia, V. Charisopoulos, and R. Willett. How do simple rotations affect the implicit bias of Adam ? arXiv preprint arXiv:2510.23804, 2025

work page arXiv 2025
[9]

Filatov, J

O. Filatov, J. Wang, J. Ebert, and S. Kesselheim. Optimal scaling needs optimal norm, 2025. arXiv:2510.03871

work page arXiv 2025
[10]

Gupta, T

V. Gupta, T. Koren, and Y. Singer. Shampoo: Preconditioned stochastic tensor optimization. In ICML, 2018

2018
[11]

Hu et al

X. Hu et al. Ostquant: Refining large language model quantization with orthogonal and scaling transformations for better distribution fitting, 2025. ICLR 2025

2025
[12]

Jordan, Y

K. Jordan, Y. Jin, V. Boza, et al. Muon: An optimizer for hidden layers in neural networks, 2024. Online manuscript

2024
[13]

D. P. Kingma and J. Ba. Adam : A method for stochastic optimization. In ICLR, 2015

2015
[14]

Kosson, B

A. Kosson, B. Messmer, and M. Jaggi. Rotational equilibrium: How weight decay balances learning across neural networks. arXiv preprint arXiv:2305.17212, 2024

work page arXiv 2024
[15]

Kunin, J

D. Kunin, J. Sagastuy-Brena, S. Ganguli, D. L. K. Yamins, and H. Tanaka. Neural mechanics: Symmetry and broken conservation laws in deep learning dynamics. In ICLR, 2021

2021
[16]

E. Lau, Z. Furman, G. Wang, D. Murfet, and S. Wei. The local learning coefficient: A singularity-aware complexity measure. In AISTATS, 2025

2025
[17]

T. T.-K. Lau and W. J. Su. A symmetry-compatible principle for optimizer design: Embeddings, LM heads, SwiGLU MLPs , and MoE routers, 2026. arXiv:2605.18106

work page internal anchor Pith review Pith/arXiv arXiv 2026
[18]

Q. Li, C. Tai, and W. E. Stochastic modified equations and dynamics of stochastic gradient algorithms I : Mathematical foundations. Journal of Machine Learning Research, 20 0 (40): 0 1--47, 2019. URL https://jmlr.org/papers/v20/17-526.html

2019
[19]

Z. Li, L. Liu, C. Liang, W. Chen, and T. Zhao. NorMuon : Making Muon more efficient and scalable, 2025. URL https://arxiv.org/abs/2510.05491

work page arXiv 2025
[20]

S. Z. Ling, N. Sharp, and A. Jacobson. VectorAdam for rotation equivariant geometry optimization. In NeurIPS, 2022. URL https://arxiv.org/abs/2205.13599

work page arXiv 2022
[21]

Z. Liu, C. Zhao, I. Fedorov, et al. Spinquant: Llm quantization with learned rotations. In ICLR, 2024

2024
[22]

Loshchilov and F

I. Loshchilov and F. Hutter. Decoupled weight decay regularization. In ICLR, 2019

2019
[23]

Martens and R

J. Martens and R. Grosse. Optimizing neural networks with kronecker-factored approximate curvature. In ICML, 2015

2015
[24]

Nanda, L

N. Nanda, L. Chan, T. Lieberum, J. Smith, and J. Steinhardt. Progress measures for grokking via mechanistic interpretability. In ICLR, 2023

2023
[25]

Pesme, L

S. Pesme, L. Pillaud-Vivien, and N. Flammarion. Implicit bias of SGD for diagonal linear networks: A provable benefit of stochasticity. In NeurIPS, 2021

2021
[26]

Training Deep Learning Models with Norm-Constrained LMOs

T. Pethick et al. Training deep learning models with norm-constrained LMOs , 2025. arXiv:2502.07529; the Scion optimizer

work page internal anchor Pith review Pith/arXiv arXiv 2025
[27]

Prieto, M

L. Prieto, M. Barsbey, P. A. M. Mediano, and T. Birdal. Grokking at the edge of numerical stability. In ICLR, 2025. URL https://arxiv.org/abs/2501.04697

work page arXiv 2025
[28]

T. P. Shirodkar. Dead directions: Geometric singular learning, 2026. URL https://arxiv.org/abs/2606.05957

work page internal anchor Pith review Pith/arXiv arXiv 2026
[29]

T. P. Shirodkar and P. J. Narayanan. Dead-direction signatures: A cheap spectral reading of singular complexity, 2026 a . URL https://arxiv.org/abs/2606.21158

work page internal anchor Pith review Pith/arXiv arXiv 2026
[30]

T. P. Shirodkar and P. J. Narayanan. Algebraic dead directions in LayerNorm transformers: A forward-pass-only diagnostic at LLM scale, 2026 b . URL https://arxiv.org/abs/2606.19491

work page internal anchor Pith review Pith/arXiv arXiv 2026
[31]

Silverstein, D

E. Silverstein, D. Kunin, and V. Shyam. Symmetry breaking in transformers for efficient and interpretable training, 2026. URL https://arxiv.org/abs/2601.22257

work page arXiv 2026
[32]

Tanaka and D

H. Tanaka and D. Kunin. Noether 's learning dynamics: Role of symmetry breaking in neural networks. In NeurIPS, 2021

2021
[33]

L2 Regularization versus Batch and Weight Normalization

T. van Laarhoven. L2 regularization versus batch and weight normalization. arXiv preprint arXiv:1706.05350, 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017
[34]

N. Vyas, D. Morwani, R. Zhao, I. Shapira, D. Brandfonbrener, L. Janson, and S. M. Kakade. SOAP : Improving and stabilizing Shampoo using Adam . In NeurIPS, 2024

2024
[35]

R. Wan, Z. Zhu, X. Zhang, and J. Sun. Spherical motion dynamics: Learning dynamics of normalized neural network using SGD and weight decay. In NeurIPS, 2021

2021
[36]

Watanabe

S. Watanabe. Algebraic Geometry and Statistical Learning Theory. Cambridge University Press, 2009

2009
[37]

S. Wei, D. Murfet, M. Gong, H. Li, J. Gell-Redman, and T. Quella. Deep learning is singular, and that's good. IEEE Transactions on Neural Networks and Learning Systems, 34 0 (12): 0 10473--10486, 2022

2022
[38]

J.-N. Yen, S. Si, Z. Meng, F. Yu, S. S. Duvvuri, I. S. Dhillon, C.-J. Hsieh, and S. Kumar. LoRA done RITE : Robust invariant transformation equilibration for LoRA optimization. In ICLR, 2025. arXiv:2410.20625

work page arXiv 2025
[39]

B. Zhao, R. Walters, and R. Yu. Symmetry in neural network parameter spaces, 2025. arXiv:2506.13018

work page arXiv 2025

[1] [1]

Absil, R

P.-A. Absil, R. Mahony, and R. Sepulchre. Optimization Algorithms on Matrix Manifolds. Princeton University Press, 2008

2008

[2] [2]

On the Optimization of Deep Networks: Implicit Acceleration by Overparameterization

S. Arora, N. Cohen, and E. Hazan. On the optimization of deep networks: Implicit acceleration by overparameterization. In ICML, 2018. URL https://arxiv.org/abs/1802.06509

work page internal anchor Pith review Pith/arXiv arXiv 2018

[3] [3]

Ashkboos, A

S. Ashkboos, A. Mohtashami, M. L. Croci, et al. Quarot: Outlier-free 4-bit inference in rotated llms. In NeurIPS, 2024

2024

[4] [4]

Riemannian Adaptive Optimization Methods

G. B\'ecigneul and O.-E. Ganea. Riemannian adaptive optimization methods. In ICLR, 2019. URL https://arxiv.org/abs/1810.00760

work page internal anchor Pith review Pith/arXiv arXiv 2019

[5] [5]

Bernstein and L

J. Bernstein and L. Newhouse. Modular duality in deep learning, 2024. arXiv:2410.21265

work page arXiv 2024

[6] [6]

Davis and W

C. Davis and W. M. Kahan. The rotation of eigenvectors by a perturbation. III . SIAM Journal on Numerical Analysis, 7 0 (1): 0 1--46, 1970

1970

[7] [7]

The Z-loss: a shift and scale invariant classification loss belonging to the Spherical Family

A. de Br \'e bisson and P. Vincent. The Z -loss: A shift and scale invariant classification loss belonging to the spherical family. arXiv preprint arXiv:1604.08859, 2016

work page internal anchor Pith review Pith/arXiv arXiv 2016

[8] [8]

M. F. DePavia, V. Charisopoulos, and R. Willett. How do simple rotations affect the implicit bias of Adam ? arXiv preprint arXiv:2510.23804, 2025

work page arXiv 2025

[9] [9]

Filatov, J

O. Filatov, J. Wang, J. Ebert, and S. Kesselheim. Optimal scaling needs optimal norm, 2025. arXiv:2510.03871

work page arXiv 2025

[10] [10]

Gupta, T

V. Gupta, T. Koren, and Y. Singer. Shampoo: Preconditioned stochastic tensor optimization. In ICML, 2018

2018

[11] [11]

Hu et al

X. Hu et al. Ostquant: Refining large language model quantization with orthogonal and scaling transformations for better distribution fitting, 2025. ICLR 2025

2025

[12] [12]

Jordan, Y

K. Jordan, Y. Jin, V. Boza, et al. Muon: An optimizer for hidden layers in neural networks, 2024. Online manuscript

2024

[13] [13]

D. P. Kingma and J. Ba. Adam : A method for stochastic optimization. In ICLR, 2015

2015

[14] [14]

Kosson, B

A. Kosson, B. Messmer, and M. Jaggi. Rotational equilibrium: How weight decay balances learning across neural networks. arXiv preprint arXiv:2305.17212, 2024

work page arXiv 2024

[15] [15]

Kunin, J

D. Kunin, J. Sagastuy-Brena, S. Ganguli, D. L. K. Yamins, and H. Tanaka. Neural mechanics: Symmetry and broken conservation laws in deep learning dynamics. In ICLR, 2021

2021

[16] [16]

E. Lau, Z. Furman, G. Wang, D. Murfet, and S. Wei. The local learning coefficient: A singularity-aware complexity measure. In AISTATS, 2025

2025

[17] [17]

T. T.-K. Lau and W. J. Su. A symmetry-compatible principle for optimizer design: Embeddings, LM heads, SwiGLU MLPs , and MoE routers, 2026. arXiv:2605.18106

work page internal anchor Pith review Pith/arXiv arXiv 2026

[18] [18]

Q. Li, C. Tai, and W. E. Stochastic modified equations and dynamics of stochastic gradient algorithms I : Mathematical foundations. Journal of Machine Learning Research, 20 0 (40): 0 1--47, 2019. URL https://jmlr.org/papers/v20/17-526.html

2019

[19] [19]

Z. Li, L. Liu, C. Liang, W. Chen, and T. Zhao. NorMuon : Making Muon more efficient and scalable, 2025. URL https://arxiv.org/abs/2510.05491

work page arXiv 2025

[20] [20]

S. Z. Ling, N. Sharp, and A. Jacobson. VectorAdam for rotation equivariant geometry optimization. In NeurIPS, 2022. URL https://arxiv.org/abs/2205.13599

work page arXiv 2022

[21] [21]

Z. Liu, C. Zhao, I. Fedorov, et al. Spinquant: Llm quantization with learned rotations. In ICLR, 2024

2024

[22] [22]

Loshchilov and F

I. Loshchilov and F. Hutter. Decoupled weight decay regularization. In ICLR, 2019

2019

[23] [23]

Martens and R

J. Martens and R. Grosse. Optimizing neural networks with kronecker-factored approximate curvature. In ICML, 2015

2015

[24] [24]

Nanda, L

N. Nanda, L. Chan, T. Lieberum, J. Smith, and J. Steinhardt. Progress measures for grokking via mechanistic interpretability. In ICLR, 2023

2023

[25] [25]

Pesme, L

S. Pesme, L. Pillaud-Vivien, and N. Flammarion. Implicit bias of SGD for diagonal linear networks: A provable benefit of stochasticity. In NeurIPS, 2021

2021

[26] [26]

Training Deep Learning Models with Norm-Constrained LMOs

T. Pethick et al. Training deep learning models with norm-constrained LMOs , 2025. arXiv:2502.07529; the Scion optimizer

work page internal anchor Pith review Pith/arXiv arXiv 2025

[27] [27]

Prieto, M

L. Prieto, M. Barsbey, P. A. M. Mediano, and T. Birdal. Grokking at the edge of numerical stability. In ICLR, 2025. URL https://arxiv.org/abs/2501.04697

work page arXiv 2025

[28] [28]

T. P. Shirodkar. Dead directions: Geometric singular learning, 2026. URL https://arxiv.org/abs/2606.05957

work page internal anchor Pith review Pith/arXiv arXiv 2026

[29] [29]

T. P. Shirodkar and P. J. Narayanan. Dead-direction signatures: A cheap spectral reading of singular complexity, 2026 a . URL https://arxiv.org/abs/2606.21158

work page internal anchor Pith review Pith/arXiv arXiv 2026

[30] [30]

T. P. Shirodkar and P. J. Narayanan. Algebraic dead directions in LayerNorm transformers: A forward-pass-only diagnostic at LLM scale, 2026 b . URL https://arxiv.org/abs/2606.19491

work page internal anchor Pith review Pith/arXiv arXiv 2026

[31] [31]

Silverstein, D

E. Silverstein, D. Kunin, and V. Shyam. Symmetry breaking in transformers for efficient and interpretable training, 2026. URL https://arxiv.org/abs/2601.22257

work page arXiv 2026

[32] [32]

Tanaka and D

H. Tanaka and D. Kunin. Noether 's learning dynamics: Role of symmetry breaking in neural networks. In NeurIPS, 2021

2021

[33] [33]

L2 Regularization versus Batch and Weight Normalization

T. van Laarhoven. L2 regularization versus batch and weight normalization. arXiv preprint arXiv:1706.05350, 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017

[34] [34]

N. Vyas, D. Morwani, R. Zhao, I. Shapira, D. Brandfonbrener, L. Janson, and S. M. Kakade. SOAP : Improving and stabilizing Shampoo using Adam . In NeurIPS, 2024

2024

[35] [35]

R. Wan, Z. Zhu, X. Zhang, and J. Sun. Spherical motion dynamics: Learning dynamics of normalized neural network using SGD and weight decay. In NeurIPS, 2021

2021

[36] [36]

Watanabe

S. Watanabe. Algebraic Geometry and Statistical Learning Theory. Cambridge University Press, 2009

2009

[37] [37]

S. Wei, D. Murfet, M. Gong, H. Li, J. Gell-Redman, and T. Quella. Deep learning is singular, and that's good. IEEE Transactions on Neural Networks and Learning Systems, 34 0 (12): 0 10473--10486, 2022

2022

[38] [38]

J.-N. Yen, S. Si, Z. Meng, F. Yu, S. S. Duvvuri, I. S. Dhillon, C.-J. Hsieh, and S. Kumar. LoRA done RITE : Robust invariant transformation equilibration for LoRA optimization. In ICLR, 2025. arXiv:2410.20625

work page arXiv 2025

[39] [39]

B. Zhao, R. Walters, and R. Yu. Symmetry in neural network parameter spaces, 2025. arXiv:2506.13018

work page arXiv 2025