Dead-Direction Conditioners: Gauge-Equivariant Preconditioning for Deep Networks
Pith reviewed 2026-06-30 07:56 UTC · model grok-4.3
The pith
Making optimizers gauge-equivariant by conditioning on symmetry orbits keeps trajectories on the loss quotient and changes the minimum reached.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
A deep network's loss is invariant to continuous symmetries of its parameters: the logit shift, the ReLU rescaling, the LayerNorm scale, the per-head attention rotation. Adam's per-coordinate preconditioner drifts along each symmetry orbit, which pulls the trajectory off the symmetry quotient where the optimization lives and blurs the singular-learning rate the quotient makes readable. We build DDC, a Dead-Direction Conditioner that lifts a base optimizer into a G-equivariant one by conditioning the optimizer's state in the orbit decomposition of a G-invariant metric, so the trajectory stays a preconditioned gradient flow on the quotient. The construction carries four architectural gauges, p
What carries the argument
The Dead-Direction Conditioner, which preconditions optimizer state in the orbit decomposition of a G-invariant metric to enforce G-equivariance and keep the trajectory on the parameter quotient.
If this is right
- On language models trained past the point of fit, DDCAdam holds a validation-train loss gap of 0.67 against 5.88 for AdamW.
- DDCAdam reads the dead-direction rate in 32 of 65 layer-by-observable cells where AdamW reads it in 7.
- A vision transformer trained from scratch with DDCAdam reaches validation loss 1.71 against 2.12 for AdamW while compressing spare feed-forward capacity.
- On a Muon base, DDCMuon groks ten of eleven seeds at depth 24 where plain Muon reaches none.
Where Pith is reading between the lines
- If the quotient geometry is preserved, quantities from singular learning theory such as the learning coefficient may become directly estimable from the trained network without additional post-processing.
- The same orbit-decomposition approach could be applied to other known symmetries in convolutional or recurrent architectures to test whether similar stabilization occurs.
- Optimizer design that explicitly accounts for the symmetry group of the loss may reduce the need for manual hyper-parameter tuning that currently compensates for orbit drift.
Load-bearing premise
The four listed architectural gauges capture the relevant continuous symmetries and the orbit decomposition preconditions without introducing new artifacts or changing the effective loss landscape.
What would settle it
A direct computation of the distance between the DDC trajectory and the symmetry quotient manifold, or a controlled run in which enforcing the gauges leaves the validation-train loss gap and dead-direction detection rates unchanged.
Figures
read the original abstract
A deep network's loss is invariant to continuous symmetries of its parameters: the logit shift, the ReLU rescaling, the LayerNorm scale, the per-head attention rotation. Adam's per-coordinate preconditioner drifts along each symmetry orbit, which pulls the trajectory off the symmetry quotient where the optimization lives and blurs the singular-learning rate the quotient makes readable. We build DDC, a Dead-Direction Conditioner that lifts a base optimizer into a $G$-equivariant one: it conditions the optimizer's state in the orbit decomposition of a $G$-invariant metric, so the trajectory stays a preconditioned gradient flow on the quotient $\bar\Theta = \Theta/G$. The construction carries four architectural gauges (cross-entropy shift, ReLU and SwiGLU rescaling, LayerNorm and RMSNorm scale, and a per-head $O(d_{\rm head})$ attention rotation matched to RoPE), proves exactly equivariant on an Adam base, and composes with a Muon base through a gauge-equivariant orthogonaliser. Respecting the symmetry changes both the minimum the optimizer reaches and what it leaves measurable there. On a language model trained past the point of fit, DDCAdam resists the over-training collapse AdamW falls into, holding a validation-train loss gap of 0.67 against 5.88, and reads the dead-direction rate in 32 of 65 layer-by-observable cells where AdamW reads it in 7. A vision transformer trained from scratch reaches lower validation loss (1.71 against 2.12) while compressing spare feed-forward capacity a matched AdamW leaves intact. On a Muon base, where the rotation gauge composes exactly, DDCMuon groks ten of eleven seeds at depth 24 that a plain Muon never reaches. Built into the optimizer, a network's gauge symmetry sharpens the minimum it finds and turns that minimum's geometry into something the trajectory can measure.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces Dead-Direction Conditioners (DDC), a method to lift base optimizers (Adam, Muon) into G-equivariant forms for four architectural gauge symmetries (cross-entropy shift, ReLU/SwiGLU rescaling, LayerNorm/RMSNorm scale, per-head O(d_head) rotations matched to RoPE). It preconditions optimizer state via orbit decomposition of a G-invariant metric so that trajectories remain preconditioned gradient flows on the quotient ar{ heta} = heta / G. The construction asserts exact equivariance on Adam and exact composition on Muon; empirical results show DDCAdam resisting over-training collapse (val-train gap 0.67 vs 5.88), lower ViT validation loss (1.71 vs 2.12), and DDCMuon achieving grokking on 10/11 seeds at depth 24 where plain Muon fails, while also increasing measurable dead-direction rates.
Significance. If the orbit-decomposition construction yields an exact horizontal lift without new effective potentials or altered singular learning rates, the work supplies a symmetry-respecting preconditioner that directly improves both the reached minimum and the readability of quotient geometry. The explicit gauge list, claimed exact equivariance proofs, and reproducible empirical gaps on language-model over-training and grokking constitute concrete strengths; the approach could sharpen singular-learning-rate diagnostics and symmetry-aware optimization more broadly.
major comments (3)
- [§3] §3 (construction) and the abstract claim of 'preconditioned gradient flow on the quotient without new effective potentials': the specific G-invariant metric and its orbit decomposition are not shown to be canonical with respect to the loss geometry. Different invariant metrics can produce different horizontal lifts, so the reported minima (validation-train gap 0.67 vs 5.88; ViT loss 1.71 vs 2.12) are consistent with either successful quotient flow or implicit landscape modification; an explicit check that the chosen metric leaves the quotient Hessian spectrum unchanged is required.
- [§4] The proof of exact equivariance on Adam (abstract and §4) and the Muon orthogonaliser composition: the derivation details, including how the four gauges are lifted and how the preconditioner state is updated under the orbit decomposition, are not visible in the provided text, leaving the central 'exact equivariance' claim only partially verifiable.
- [Empirical section] Table reporting dead-direction rates (32/65 cells for DDCAdam vs 7/65 for AdamW) and grokking counts: these are load-bearing for the claim that respecting symmetry 'turns that minimum's geometry into something the trajectory can measure'; the layer-by-observable breakdown and seed statistics must be accompanied by controls confirming that the increase is not an artifact of the metric choice altering effective learning rates on the quotient.
minor comments (2)
- [§2] Notation for the quotient ar{ heta} = heta / G and the horizontal lift should be introduced with a short diagram or explicit coordinate chart in the first section where the orbit decomposition appears.
- [Abstract / §3] The four gauges are listed in the abstract; a compact table mapping each gauge to its group action and the corresponding invariant metric component would improve readability.
Simulated Author's Rebuttal
We thank the referee for the detailed and constructive report. We address each major comment below and indicate planned revisions.
read point-by-point responses
-
Referee: [§3] §3 (construction) and the abstract claim of 'preconditioned gradient flow on the quotient without new effective potentials': the specific G-invariant metric and its orbit decomposition are not shown to be canonical with respect to the loss geometry. Different invariant metrics can produce different horizontal lifts, so the reported minima (validation-train gap 0.67 vs 5.88; ViT loss 1.71 vs 2.12) are consistent with either successful quotient flow or implicit landscape modification; an explicit check that the chosen metric leaves the quotient Hessian spectrum unchanged is required.
Authors: The Euclidean metric on parameter space is the canonical choice compatible with Adam's coordinate-wise structure and the architectural gauges; the orbit decomposition is constructed precisely so that the horizontal lift reproduces the base optimizer's preconditioned flow projected onto the quotient without adding effective potentials. While other invariant metrics could produce different lifts, the reported improvements are tied to this standard choice. We agree an explicit argument or numerical check confirming the quotient Hessian spectrum is unchanged would strengthen the claim and will add it to a revised §3. revision: yes
-
Referee: [§4] The proof of exact equivariance on Adam (abstract and §4) and the Muon orthogonaliser composition: the derivation details, including how the four gauges are lifted and how the preconditioner state is updated under the orbit decomposition, are not visible in the provided text, leaving the central 'exact equivariance' claim only partially verifiable.
Authors: The full derivations for lifting each gauge (cross-entropy shift, ReLU/SwiGLU rescaling, LayerNorm/RMSNorm scale, per-head O(d_head) rotation) and the state-update rules under orbit decomposition appear in Appendix B; the main-text §4 summarizes the resulting equivariance statements. We will expand the main-text presentation of the key lifting steps and Muon composition to make the proof self-contained without requiring the appendix. revision: yes
-
Referee: [Empirical section] Table reporting dead-direction rates (32/65 cells for DDCAdam vs 7/65 for AdamW) and grokking counts: these are load-bearing for the claim that respecting symmetry 'turns that minimum's geometry into something the trajectory can measure'; the layer-by-observable breakdown and seed statistics must be accompanied by controls confirming that the increase is not an artifact of the metric choice altering effective learning rates on the quotient.
Authors: We will augment the empirical section with controls that match effective learning rates on the quotient (via rescaled step-size ablations) and compare against non-equivariant runs that apply the same metric without orbit decomposition, thereby isolating the contribution of gauge-equivariant preconditioning from possible learning-rate side effects. revision: yes
Circularity Check
No circularity: explicit construction with independent empirical results
full rationale
The paper presents DDC as an explicit mathematical construction that lifts base optimizers (Adam, Muon) into G-equivariant forms via orbit decomposition of a G-invariant metric on four specified architectural gauges, with a claimed proof of exact equivariance. The central claims rest on this construction and on reported empirical outcomes (validation-train gaps, dead-direction detection rates, grokking seeds, ViT losses) rather than any reduction of those outcomes to quantities fitted inside the same experiment. No self-citations, self-definitional loops, fitted-input predictions, or smuggled ansatzes appear in the provided text that would make the derivation equivalent to its inputs by construction. The method is therefore self-contained against external benchmarks.
Axiom & Free-Parameter Ledger
axioms (2)
- domain assumption The network loss is invariant under the four listed continuous parameter symmetries (logit shift, ReLU/SwiGLU rescaling, LayerNorm/RMSNorm scale, per-head attention rotation).
- domain assumption A G-invariant metric exists whose orbit decomposition yields a well-defined preconditioner on the quotient.
Reference graph
Works this paper leans on
-
[1]
Absil, R
P.-A. Absil, R. Mahony, and R. Sepulchre. Optimization Algorithms on Matrix Manifolds. Princeton University Press, 2008
2008
-
[2]
On the Optimization of Deep Networks: Implicit Acceleration by Overparameterization
S. Arora, N. Cohen, and E. Hazan. On the optimization of deep networks: Implicit acceleration by overparameterization. In ICML, 2018. URL https://arxiv.org/abs/1802.06509
work page internal anchor Pith review Pith/arXiv arXiv 2018
-
[3]
Ashkboos, A
S. Ashkboos, A. Mohtashami, M. L. Croci, et al. Quarot: Outlier-free 4-bit inference in rotated llms. In NeurIPS, 2024
2024
-
[4]
Riemannian Adaptive Optimization Methods
G. B\'ecigneul and O.-E. Ganea. Riemannian adaptive optimization methods. In ICLR, 2019. URL https://arxiv.org/abs/1810.00760
work page internal anchor Pith review Pith/arXiv arXiv 2019
-
[5]
J. Bernstein and L. Newhouse. Modular duality in deep learning, 2024. arXiv:2410.21265
-
[6]
Davis and W
C. Davis and W. M. Kahan. The rotation of eigenvectors by a perturbation. III . SIAM Journal on Numerical Analysis, 7 0 (1): 0 1--46, 1970
1970
-
[7]
The Z-loss: a shift and scale invariant classification loss belonging to the Spherical Family
A. de Br \'e bisson and P. Vincent. The Z -loss: A shift and scale invariant classification loss belonging to the spherical family. arXiv preprint arXiv:1604.08859, 2016
work page internal anchor Pith review Pith/arXiv arXiv 2016
- [8]
-
[9]
O. Filatov, J. Wang, J. Ebert, and S. Kesselheim. Optimal scaling needs optimal norm, 2025. arXiv:2510.03871
-
[10]
Gupta, T
V. Gupta, T. Koren, and Y. Singer. Shampoo: Preconditioned stochastic tensor optimization. In ICML, 2018
2018
-
[11]
Hu et al
X. Hu et al. Ostquant: Refining large language model quantization with orthogonal and scaling transformations for better distribution fitting, 2025. ICLR 2025
2025
-
[12]
Jordan, Y
K. Jordan, Y. Jin, V. Boza, et al. Muon: An optimizer for hidden layers in neural networks, 2024. Online manuscript
2024
-
[13]
D. P. Kingma and J. Ba. Adam : A method for stochastic optimization. In ICLR, 2015
2015
- [14]
-
[15]
Kunin, J
D. Kunin, J. Sagastuy-Brena, S. Ganguli, D. L. K. Yamins, and H. Tanaka. Neural mechanics: Symmetry and broken conservation laws in deep learning dynamics. In ICLR, 2021
2021
-
[16]
E. Lau, Z. Furman, G. Wang, D. Murfet, and S. Wei. The local learning coefficient: A singularity-aware complexity measure. In AISTATS, 2025
2025
-
[17]
T. T.-K. Lau and W. J. Su. A symmetry-compatible principle for optimizer design: Embeddings, LM heads, SwiGLU MLPs , and MoE routers, 2026. arXiv:2605.18106
work page internal anchor Pith review Pith/arXiv arXiv 2026
-
[18]
Q. Li, C. Tai, and W. E. Stochastic modified equations and dynamics of stochastic gradient algorithms I : Mathematical foundations. Journal of Machine Learning Research, 20 0 (40): 0 1--47, 2019. URL https://jmlr.org/papers/v20/17-526.html
2019
- [19]
- [20]
-
[21]
Z. Liu, C. Zhao, I. Fedorov, et al. Spinquant: Llm quantization with learned rotations. In ICLR, 2024
2024
-
[22]
Loshchilov and F
I. Loshchilov and F. Hutter. Decoupled weight decay regularization. In ICLR, 2019
2019
-
[23]
Martens and R
J. Martens and R. Grosse. Optimizing neural networks with kronecker-factored approximate curvature. In ICML, 2015
2015
-
[24]
Nanda, L
N. Nanda, L. Chan, T. Lieberum, J. Smith, and J. Steinhardt. Progress measures for grokking via mechanistic interpretability. In ICLR, 2023
2023
-
[25]
Pesme, L
S. Pesme, L. Pillaud-Vivien, and N. Flammarion. Implicit bias of SGD for diagonal linear networks: A provable benefit of stochasticity. In NeurIPS, 2021
2021
-
[26]
Training Deep Learning Models with Norm-Constrained LMOs
T. Pethick et al. Training deep learning models with norm-constrained LMOs , 2025. arXiv:2502.07529; the Scion optimizer
work page internal anchor Pith review Pith/arXiv arXiv 2025
- [27]
-
[28]
T. P. Shirodkar. Dead directions: Geometric singular learning, 2026. URL https://arxiv.org/abs/2606.05957
work page internal anchor Pith review Pith/arXiv arXiv 2026
-
[29]
T. P. Shirodkar and P. J. Narayanan. Dead-direction signatures: A cheap spectral reading of singular complexity, 2026 a . URL https://arxiv.org/abs/2606.21158
work page internal anchor Pith review Pith/arXiv arXiv 2026
-
[30]
T. P. Shirodkar and P. J. Narayanan. Algebraic dead directions in LayerNorm transformers: A forward-pass-only diagnostic at LLM scale, 2026 b . URL https://arxiv.org/abs/2606.19491
work page internal anchor Pith review Pith/arXiv arXiv 2026
-
[31]
E. Silverstein, D. Kunin, and V. Shyam. Symmetry breaking in transformers for efficient and interpretable training, 2026. URL https://arxiv.org/abs/2601.22257
-
[32]
Tanaka and D
H. Tanaka and D. Kunin. Noether 's learning dynamics: Role of symmetry breaking in neural networks. In NeurIPS, 2021
2021
-
[33]
L2 Regularization versus Batch and Weight Normalization
T. van Laarhoven. L2 regularization versus batch and weight normalization. arXiv preprint arXiv:1706.05350, 2017
work page internal anchor Pith review Pith/arXiv arXiv 2017
-
[34]
N. Vyas, D. Morwani, R. Zhao, I. Shapira, D. Brandfonbrener, L. Janson, and S. M. Kakade. SOAP : Improving and stabilizing Shampoo using Adam . In NeurIPS, 2024
2024
-
[35]
R. Wan, Z. Zhu, X. Zhang, and J. Sun. Spherical motion dynamics: Learning dynamics of normalized neural network using SGD and weight decay. In NeurIPS, 2021
2021
-
[36]
Watanabe
S. Watanabe. Algebraic Geometry and Statistical Learning Theory. Cambridge University Press, 2009
2009
-
[37]
S. Wei, D. Murfet, M. Gong, H. Li, J. Gell-Redman, and T. Quella. Deep learning is singular, and that's good. IEEE Transactions on Neural Networks and Learning Systems, 34 0 (12): 0 10473--10486, 2022
2022
- [38]
- [39]
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.