pith. sign in

arxiv: 2605.18106 · v1 · pith:AGW62PWNnew · submitted 2026-05-18 · 🧮 math.OC · cs.AI· cs.LG· stat.ML

Symmetry-Compatible Principle for Optimizer Design: Embeddings, LM Heads, SwiGLU MLPs, and MoE Routers

Pith reviewed 2026-05-20 09:30 UTC · model grok-4.3

classification 🧮 math.OC cs.AIcs.LGstat.ML
keywords symmetry-compatible optimizerequivariant updatesembedding matricesMoE routersSwiGLU projectionslanguage model pretrainingpermutation symmetryAdamW alternatives
0
0 comments X

The pith

Gradient updates should be equivariant under the symmetry group of each weight block to improve training.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces a symmetry-compatible principle stating that the gradient update rule must respect the equivariance properties of the parameter space in neural networks. This leads to tailored optimizers for blocks with permutation symmetries, such as embeddings and LM heads, and shared-shift symmetries, such as MoE routers, along with hybrid rules for SwiGLU projections. The authors assemble these into a full layerwise optimizer stack and test it in pre-training of both dense and sparse MoE language models. Experiments show consistent reductions in final validation loss compared with AdamW, plus stability gains in some runs. A sympathetic reader would see this as closing a long-standing mismatch between architectural geometry and coordinate-wise optimization practice.

Core claim

The central claim is that the gradient update rule should be equivariant under the symmetry group acting on the corresponding weight block, yielding one-sided spectral, row-norm, hybrid row-norm/spectral, row-aware, column-aware, centered row-norm, and left-spectral updates for embeddings, LM heads, SwiGLU MLPs, and MoE routers that together outperform standard AdamW updates.

What carries the argument

The symmetry-compatible principle, requiring that the update rule be equivariant under the symmetry group of each weight block.

If this is right

  • Symmetry-compatible updates improve final validation loss over AdamW across dense and MoE language models.
  • The updates also improve training stability in several of the reported experiments.
  • The constructions form a complete end-to-end layerwise optimizer stack that assigns each major matrix parameter class an update matching its symmetry group.
  • The new rules include row-norm, one-sided spectral, and hybrid variants derived from permutation and shared-shift symmetries.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same principle could be applied to other layer types whose symmetries have not yet been catalogued.
  • Architectures might eventually be co-designed so that their parameter symmetries align more cleanly with efficient equivariant updates.
  • Checking whether the loss gains persist or reverse when models are scaled by another order of magnitude would test the robustness of the reported improvements.

Load-bearing premise

That enforcing equivariance under the identified symmetry groups for these parameter blocks is sufficient to produce better optimizers without hidden costs that only appear at larger scale.

What would settle it

A controlled pre-training run on any of the tested model families in which the symmetry-compatible stack produces equal or higher validation loss than AdamW at the same compute budget would falsify the practical advantage.

Figures

Figures reproduced from arXiv: 2605.18106 by Tim Tsz-Kit Lau, Weijie Su.

Figure 1
Figure 1. Figure 1: Two perspectives on deep learning optimization. Left: coordinate-wise adaptive methods [PITH_FULL_IMAGE:figures/full_fig_p003_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Validation losses for downsized gpt-oss pre-training. The configurations differ in the opti [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Training and validation losses for Qwen3-0.6B-style pre-training. In each subfigure, the [PITH_FULL_IMAGE:figures/full_fig_p022_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Training and validation losses for Gemma 3 1B-style pre-training. In each subfigure, the [PITH_FULL_IMAGE:figures/full_fig_p023_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Training and validation losses for OLMoE-1B-7B-style pre-training. The configurations differ in the optimizers assigned to the embedding, LM head, and router matrices. The final validation losses for configurations (i)–(iv) are 4.0814, 4.0717, 4.1083, and 4.1155 respectively. As shown in [PITH_FULL_IMAGE:figures/full_fig_p025_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Training and validation losses for downsized gpt-oss pre-training. The configurations [PITH_FULL_IMAGE:figures/full_fig_p026_6.png] view at source ↗
read the original abstract

A striking geometric disparity has long persisted in the practice of deep learning. While modern neural network architectures naturally exhibit rich symmetry and equivariance properties, popular optimizers such as Adam and its variants operate inherently coordinate-wise, rendering them unable to respect the equivariance structures of the parameter space. We address this disparity by introducing a symmetry-compatible principle for optimizer design: the gradient update rule should be equivariant under the symmetry group acting on the corresponding weight block. Following this principle, we first provide a unified perspective on bi-orthogonally equivariant updates for general matrix layers, as employed by stochastic spectral descent, Muon, Scion, and polar gradient methods. More importantly, by moving from orthogonal groups to permutation and shared-shift symmetries, we derive symmetry-compatible optimizers for parameter blocks whose symmetries differ from those of general matrix layers: embedding and LM head matrices, SwiGLU MLP projections, and MoE router matrices. These constructions include one-sided spectral, row-norm, hybrid row-norm/spectral, row-aware, column-aware, centered row-norm, and left-spectral updates. They yield an end-to-end layerwise optimizer stack in which each major matrix-valued parameter class is assigned an update whose equivariance matches its symmetry group. We corroborate this principle through pre-training experiments on dense and sparse MoE language models, including Qwen3-0.6B-style, Gemma 3 1B-style, OLMoE-1B-7B-style, and downsized gpt-oss architectures. Across these experiments, symmetry-compatible updates consistently improve final validation loss, and in several cases training stability, over corresponding AdamW updates.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper proposes a symmetry-compatible principle for optimizer design: gradient updates must be equivariant under the symmetry group acting on each weight block. It unifies bi-orthogonally equivariant methods for matrix layers and derives new updates (row-norm, one-sided spectral, hybrid row-norm/spectral, centered row-norm, left-spectral, row-aware, column-aware) for blocks with permutation symmetries (embeddings, LM heads) and shared-shift symmetries (MoE routers, SwiGLU projections). These are assembled into a layerwise optimizer stack and tested via pre-training runs on Qwen3-0.6B-style, Gemma-3-1B-style, OLMoE-1B-7B-style, and downsized gpt-oss models, where the symmetry-compatible updates are reported to yield lower final validation loss and occasional stability gains relative to AdamW.

Significance. If the performance gains are shown to arise specifically from the enforced equivariance (rather than incidental rescaling), the principle supplies a systematic route to architecture-aware optimizers that respect the permutation and shift structures already present in modern LLM components. The multi-architecture experimental sweep and the unified treatment of bi-orthogonal updates constitute concrete strengths that would remain valuable even if the central empirical claim requires additional controls.

major comments (2)
  1. [Experimental section] Experimental section: the manuscript reports consistent validation-loss improvements for the symmetry-compatible updates (row-norm, one-sided spectral, hybrid, centered row-norm, left-spectral) over AdamW across Qwen3-0.6B, Gemma-3-1B, OLMoE and gpt-oss runs, yet provides no per-method learning-rate sweeps or explicit matching of mean update norms / effective step lengths. Because the proposed rules embed explicit row/column normalizations and spectral projections absent from coordinate-wise AdamW, the observed gains could be produced by an implicit change in gradient magnitude rather than by permutation or shared-shift equivariance. This control is load-bearing for the central claim.
  2. [§3 (Derivations of symmetry-compatible updates)] §3 (Derivations of symmetry-compatible updates): the principle is introduced as an independent design axiom whose justification rests on external validation by the loss curves. It would strengthen the manuscript to demonstrate, even for a single block, that the derived equivariant update cannot be recovered from AdamW by a simple global rescaling or reparameterization of the learning rate.
minor comments (2)
  1. [Abstract] Abstract and experimental summary: the claim of 'consistent loss improvements' and 'several cases' of improved stability would be more informative if accompanied by quantitative effect sizes, number of independent runs, and standard deviations.
  2. [Notation and definitions] Notation: the precise definitions of 'one-sided spectral', 'hybrid row-norm/spectral', and 'left-spectral' updates should be stated with explicit matrix formulas in the main text rather than left to supplementary material.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the detailed and constructive report. We agree that isolating the role of equivariance from possible rescaling effects is essential for the central claim, and we outline revisions below to address both major comments.

read point-by-point responses
  1. Referee: [Experimental section] Experimental section: the manuscript reports consistent validation-loss improvements for the symmetry-compatible updates (row-norm, one-sided spectral, hybrid, centered row-norm, left-spectral) over AdamW across Qwen3-0.6B, Gemma-3-1B, OLMoE and gpt-oss runs, yet provides no per-method learning-rate sweeps or explicit matching of mean update norms / effective step lengths. Because the proposed rules embed explicit row/column normalizations and spectral projections absent from coordinate-wise AdamW, the observed gains could be produced by an implicit change in gradient magnitude rather than by permutation or shared-shift equivariance. This control is load-bearing for the central claim.

    Authors: We acknowledge that the current experiments lack per-method learning-rate sweeps and explicit matching of mean update norms. The symmetry-compatible rules do introduce row/column normalizations and spectral projections that alter effective magnitudes in a structured manner, so the referee's concern is valid. In the revised manuscript we will add experiments that tune learning rates separately for each method and explicitly match mean update norms (or effective step lengths) to AdamW baselines. This will allow direct comparison under comparable gradient magnitudes and thereby strengthen evidence that gains arise from permutation and shared-shift equivariance. revision: yes

  2. Referee: [§3 (Derivations of symmetry-compatible updates)] §3 (Derivations of symmetry-compatible updates): the principle is introduced as an independent design axiom whose justification rests on external validation by the loss curves. It would strengthen the manuscript to demonstrate, even for a single block, that the derived equivariant update cannot be recovered from AdamW by a simple global rescaling or reparameterization of the learning rate.

    Authors: We agree that an explicit demonstration would reinforce the independence of the derived updates. For the embedding-matrix case we will add a short algebraic comparison in §3 (or an appendix) showing that the row-norm update scales each row by the inverse of its own gradient norm. This per-row, data-dependent scaling cannot be reproduced by any fixed global multiplier applied to an AdamW update, because the latter remains strictly coordinate-wise and lacks the row-wise aggregation required by permutation equivariance. A brief numerical counter-example on a small matrix will be included to illustrate that no single rescaling factor equates the two rules. revision: yes

Circularity Check

0 steps flagged

Symmetry principle stated as independent axiom; derivation self-contained with external validation

full rationale

The paper introduces the symmetry-compatible principle as a design axiom (gradient update equivariant under the symmetry group of each weight block) and derives specific update rules for embeddings, LM heads, SwiGLU MLPs, and MoE routers from that axiom plus standard equivariance requirements. No step reduces a claimed prediction or result to a fitted parameter or self-citation by construction. Experiments on Qwen3, Gemma, OLMoE, and gpt-oss models serve as independent validation rather than the source of the updates. The central claim remains independent of the reported loss curves.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the single design axiom that updates must be equivariant under the symmetry group of each weight block, plus the empirical observation that the resulting rules outperform AdamW; no free parameters or new postulated entities are introduced.

axioms (1)
  • domain assumption The gradient update rule should be equivariant under the symmetry group acting on the corresponding weight block.
    This is the load-bearing design principle stated in the abstract and used to derive all subsequent update rules.

pith-pipeline@v0.9.0 · 5853 in / 1248 out tokens · 32650 ms · 2026-05-20T09:30:08.976057+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

176 extracted references · 176 canonical work pages · 23 internal anchors

  1. [1]

    Abbe and E

    E. Abbe and E. Boix-Adsera. On the non-universality of deep learning: quantifying the cost of symmetry. InAdvances in Neural Information Processing Systems (NeurIPS). 2022

  2. [2]

    K. Ahn, N. Amsel, and J. Langford. Dion2: A simple method to shrink matrix in Muon.arXiv preprint 2512.16928, 2025

  3. [3]

    K. Ahn, B. Xu, N. Abreu, Y. Fan, G. Magakyan, P. Sharma, Z. Zhan, and J. Langford. Dion: Distributed orthonormalized updates.arXiv preprint arXiv:2504.05295, 2025

  4. [4]

    Ainslie, J

    J. Ainslie, J. Lee-Thorp, M. De Jong, Y. Zemlyanskiy, F. Lebrón, and S. Sanghai. GQA: Training gen- eralized multi-query transformer models from multi-head checkpoints. InProceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP). 2023

  5. [5]

    Amsel, D

    N. Amsel, D. Persson, C. Musco, and R. M. Gower. The Polar Express: Optimal matrix sign methods and their application to the Muon algorithm. InInternational Conference on Learning Representations (ICLR). 2026

  6. [6]

    K. An, Y. Liu, R. Pan, S. Ma, D. Goldfarb, and T. Zhang. ASGO: Adaptive structured gradient optimization. InAdvances in Neural Information Processing Systems (NeurIPS). 2025

  7. [7]

    R. Anil, V. Gupta, T. Koren, K. Regan, and Y. Singer. Scalable second order optimization for deep learning.arXiv preprint arXiv:2002.09018, 2020

  8. [8]

    Anthony, Y

    Q. Anthony, Y. Tokpanov, S. Szot, S. Rajagopal, P. Medepalli, R. Iyer, V. Shyam, A. Golubeva, A. Chaurasia, et al. Training foundation models on a full-stack AMD platform: Compute, networking, and system design.arXiv preprint arXiv:2511.17127, 2025

  9. [9]

    L. Autonne. Sur les groupes linéaires, réels et orthogonaux.Bulletin de la Société Mathématique de France, 30:121–134, 1902

  10. [10]

    E. Bao, J. Lu, L. Song, N. Hart-Hodgson, W. Parson, and Y. Zhou. Equivariant neural networks and equivarification.arXiv preprint arXiv:1906.07172, 2019

  11. [11]

    Bernstein

    J. Bernstein. Deriving Muon. 2025

  12. [12]

    Bernstein

    J. Bernstein. Modular manifolds.Thinking Machines Lab: Connectionism, 2025. https:// thinkingmachines.ai/blog/modular-manifolds/

  13. [13]

    Bernstein and L

    J. Bernstein and L. Newhouse. Old optimizer, new norm: An anthology. InOPT 2024: Optimization for Machine Learning. 2024

  14. [14]

    Bernstein and L

    J. Bernstein and L. Newhouse. Modular duality in deep learning. InProceedings of the International Conference on Machine Learning (ICML). 2025

  15. [15]

    Bernstein, Y.-X

    J. Bernstein, Y.-X. Wang, K. Azizzadenesheli, and A. Anandkumar. signSGD: Compressed optimisation for non-convex problems. InProceedings of the International Conference on Machine Learning (ICML). 2018

  16. [16]

    Bhatia.Matrix Analysis, volume 169

    R. Bhatia.Matrix Analysis, volume 169. Springer Science & Business Media, 2013

  17. [17]

    Boissin, T

    T. Boissin, T. Massena, F. Mamalet, and M. Serrurier. Turbo-Muon: Accelerating orthogonality-based optimization with pre-conditioning.arXiv preprint arXiv:2512.04632, 2025

  18. [18]

    Shashank Gupta, Chaitanya Ahuja, Tsung-Yu Lin, Sreya Dutta Ro y, Harrie Oosterhuis, Maarten de Rijke, and Satya Narayan Shukla

    F. Bordes, R. Y. Pang, A. Ajay, A. C. Li, A. Bardes, S. Petryk, O. Mañas, Z. Lin, A. Mahmoud, et al. An introduction to vision-language modeling.arXiv preprint arXiv:2405.17247, 2024. 30

  19. [19]

    T. B. Brown, B. Mann, N. Ryder, M. Subbiah, J. Kaplan, P. Dhariwal, A. Neelakantan, P. Shyam, G. Sastry, et al. Language models are few-shot learners. InAdvances in Neural Information Processing Systems (NeurIPS). 2020

  20. [20]

    Buchanan

    S. Buchanan. A faster manifold Muon with ADMM.https://sdbuchanan.com/blog/manifold-muon/, 2025

  21. [21]

    Carlson, V

    D. Carlson, V. Cevher, and L. Carin. Stochastic spectral descent for restricted Boltzmann machines. In Proceedings of the International Conference on Artificial Intelligence and Statistics (AISTATS). 2015

  22. [22]

    Carlson, E

    D. Carlson, E. Collins, Y.-P. Hsieh, L. Carin, and V. Cevher. Preconditioned spectral descent for deep learning. InAdvances in Neural Information Processing Systems (NeurIPS). 2015

  23. [23]

    Carlson, Y.-P

    D. Carlson, Y.-P. Hsieh, E. Collins, L. Carin, and V. Cevher. Stochastic spectral descent for discrete graphical models.IEEE Journal of Selected Topics in Signal Processing, 10(2):296–311, 2016

  24. [24]

    On the Convergence of Muon and Beyond

    D. Chang, Y. Liu, and G. Yuan. On the convergence of Muon and beyond.arXiv preprint arXiv:2509.15816, 2025

  25. [25]

    MuonEq: Balancing Before Orthogonalization with Lightweight Equilibration

    D. Chang, Q. Shi, L. Zhang, Y. Li, R. Zhang, Y. Lu, Y. Liu, and G. Yuan. MuonEq: Balancing before orthogonalization with lightweight equilibration.arXiv preprint arXiv:2603.28254, 2026

  26. [26]

    L. Chen, J. Li, and Q. Liu. Muon optimizes under spectral norm constraints.arXiv preprint arXiv:2506.15054, 2025

  27. [27]

    Y. Chen, Y. Chi, J. Fan, and C. Ma. Spectral methods for data science: A statistical perspective. Foundations and Trends®in Machine Learning, 14(5):566–806, 2021

  28. [28]

    An exploration of non-euclidean gradient descent: Muon and its many variants.arXiv preprint arXiv:2510.09827,

    M. Crawshaw, C. Modi, M. Liu, and R. M. Gower. An exploration of non-Euclidean gradient descent: Muon and its many variants.arXiv preprint arXiv:2510.09827, 2025

  29. [29]

    G. E. Dahl, F. Schneider, Z. Nado, N. Agarwal, C. S. Sastry, P. Hennig, S. Medapati, R. Eschenhagen, P. Kasimbeg, et al. Benchmarking neural network training algorithms.arXiv preprint arXiv:2306.07179, 2023

  30. [30]

    Dao and A

    T. Dao and A. Gu. Transformers are SSMs: Generalized models and efficient algorithms through structured state space duality. InProceedings of the International Conference on Machine Learning (ICML). 2024

  31. [31]

    Y. N. Dauphin, A. Fan, M. Auli, and D. Grangier. Language modeling with gated convolutional networks. InProceedings of the International Conference on Machine Learning (ICML). 2017

  32. [32]

    Davis and D

    D. Davis and D. Drusvyatskiy. When do spectral gradient updates help in deep learning?arXiv preprint arXiv:2512.04299, 2025

  33. [33]

    DeepSeek-V4: Towards highly efficient million-token context intelligence

    DeepSeek-AI. DeepSeek-V4: Towards highly efficient million-token context intelligence. 2026

  34. [34]

    DeepSeek-AI, A. Liu, B. Feng, B. Xue, B. Wang, B. Wu, C. Lu, C. Zhao, C. Deng, et al. DeepSeek-V3 technical report.arXiv preprint arXiv:2412.19437, 2024

  35. [35]

    Dehghani, J

    M. Dehghani, J. Djolonga, B. Mustafa, P. Padlewski, J. Heek, J. Gilmer, A. Steiner, M. Caron, R. Geirhos, et al. Scaling vision transformers to 22 billion parameters. InProceedings of the International Conference on Machine Learning (ICML). 2023

  36. [36]

    S. Deng, Z. Ouyang, T. Pang, Z. Liu, R. Jin, S. Yu, and Y. Yang. RMNP: Row-momentum normalized preconditioning for scalable matrix-based optimization.arXiv preprint arXiv:2603.20527, 2026

  37. [37]

    Dewulf, D

    A. Dewulf, D. Pai, L. Yang, A. Zhang, and B. Keigwin. Aurora: A leverage-aware optimizer for rectangular matrices. 2026. 31

  38. [38]

    C. Ding, D. Sun, J. Sun, and K.-C. Toh. Spectral operators of matrices.Mathematical Programming, 168(1):509–531, 2018

  39. [39]

    C. Ding, D. Sun, J. Sun, and K.-C. Toh. Spectral operators of matrices: Semismoothness and characterizations of the generalized Jacobian.SIAM Journal on Optimization, 30(1):630–659, 2020

  40. [40]

    Dosovitskiy, L

    A. Dosovitskiy, L. Beyer, A. Kolesnikov, D. Weissenborn, X. Zhai, T. Unterthiner, M. Dehghani, M. Minderer, G. Heigold, et al. An image is worth 16x16 words: Transformers for image recognition at scale. InInternational Conference on Learning Representations (ICLR). 2021

  41. [41]

    Z. Du, H. He, and W. Su. Uncovering symmetry transfer in large language models via layer-peeled optimization.arXiv preprint arXiv:2605.12756, 2026

  42. [42]

    Du and W

    Z. Du and W. Su. The Newton–Muon optimizer.arXiv preprint arXiv:2604.01472, 2026

  43. [43]

    Duchi, E

    J. Duchi, E. Hazan, and Y. Singer. Adaptive subgradient methods for online learning and stochastic optimization.Journal of Machine Learning Research, 12:2121–2159, 2011

  44. [44]

    Eschenhagen, A

    R. Eschenhagen, A. Cai, T.-H. Lee, and H.-J. M. Shi. Clarifying Shampoo: Adapting spectral descent to stochasticity and the parameter trajectory.arXiv preprint arXiv:2602.09314, 2026

  45. [45]

    Eschenhagen, A

    R. Eschenhagen, A. Immer, R. Turner, F. Schneider, and P. Hennig. Kronecker-factored approximate curvature for modern neural network architectures. InAdvances in Neural Information Processing Systems (NeurIPS). 2023

  46. [46]

    Layer sharding for large-scale training with Muon.https://www.essential.ai/research/ infra, 2025

    Essential AI. Layer sharding for large-scale training with Muon.https://www.essential.ai/research/ infra, 2025

  47. [47]

    Essential AI, I. Shah, A. M. Polloreno, K. Stratos, P. Monk, A. Chaluvaraju, A. Hojel, A. Ma, A. Thomas, et al. Practical efficiency of Muon for pretraining.arXiv preprint arXiv:2505.02222, 2025

  48. [48]

    Filatov, J

    O. Filatov, J. Wang, J. Ebert, and S. Kesselheim. Optimal scaling needs optimal norm.arXiv preprint arXiv:2510.03871, 2025

  49. [49]

    Frans, S

    K. Frans, S. Levine, and P. Abbeel. A stable whitening optimizer for efficient neural network training. InAdvances in Neural Information Processing Systems (NeurIPS). 2025

  50. [50]

    Gemma 3 Technical Report

    Gemma Team, A. Kamath, J. Ferret, S. Pathak, N. Vieillard, R. Merhej, S. Perrin, T. Matejovicova, A. Ramé, et al. Gemma 3 technical report.arXiv preprint arXiv:2503.19786, 2025

  51. [51]

    Glentis, J

    A. Glentis, J. Li, A. Han, and M. Hong. A minimalist optimizer design for LLM pretraining.arXiv preprint arXiv:2506.16659, 2025

  52. [52]

    GLM-4.5 Team, A. Zeng, X. Lv, Q. Zheng, Z. Hou, B. Chen, C. Xie, C. Wang, D. Yin, et al. GLM-4.5: Agentic, reasoning, and coding (ARC) foundation models.arXiv preprint arXiv:2508.06471, 2025

  53. [53]

    GLM-5 Team, A. Zeng, X. Lv, Z. Hou, Z. Du, Q. Zheng, B. Chen, D. Yin, C. Ge, et al. GLM-5: from vibe coding to agentic engineering.arXiv preprint arXiv:2602.15763, 2026

  54. [54]

    Goldfarb, Y

    D. Goldfarb, Y. Ren, and A. Bahamou. Practical quasi-Newton methods for training deep neural networks. InAdvances in Neural Information Processing Systems (NeurIPS). 2020

  55. [55]

    S. Gong, S. Agarwal, Y. Zhang, J. Ye, L. Zheng, M. Li, C. An, P. Zhao, W. Bi, et al. Scaling diffusion language models via adaptation from autoregressive models. InInternational Conference on Learning Representations (ICLR). 2025

  56. [56]

    W. Gong, J. Zazo, Q. Luo, P. Wang, J. Hensman, and C. Ma. ARO: A new lens on matrix optimization for large models.arXiv preprint arXiv:2602.09006, 2026. 32

  57. [57]

    Gonon, A.-A

    A. Gonon, A.-A. Muşat, and N. Boumal. Insights on Muon from simple quadratics.arXiv preprint arXiv:2602.11948, 2026

  58. [58]

    Gemma 4 model card

    Google DeepMind. Gemma 4 model card. 2026

  59. [59]

    Grishina, M

    E. Grishina, M. Smirnov, and M. Rakhuba. Accelerating Newton-Schulz iteration for orthogonalization via Chebyshev-type polynomials.arXiv preprint arXiv:2506.10935, 2025

  60. [60]

    Gu and T

    A. Gu and T. Dao. Mamba: Linear-time sequence modeling with selective state spaces. InProceedings of the Conference on Language Modeling (COLM). 2024

  61. [61]

    Gu and Z

    Y. Gu and Z. Xie. Mano: Restriking manifold optimization for LLM training.arXiv preprint arXiv:2601.23000, 2026

  62. [62]

    Gupta, T

    V. Gupta, T. Koren, and Y. Singer. Shampoo: Preconditioned stochastic tensor optimization. In Proceedings of the International Conference on Machine Learning (ICML). 2018

  63. [63]

    K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). 2016

  64. [64]

    N. J. Higham. Computing the polar decomposition—with applications.SIAM Journal on Scientific and Statistical Computing, 7(4):1160–1174, 1986

  65. [65]

    N. J. Higham. Stable iterations for the matrix square root.Numerical Algorithms, 15(2):227–242, 1997

  66. [66]

    N. J. Higham.Functions of Matrices: Theory and Computation. Society for Industrial and Applied Mathematics, 2008

  67. [67]

    Hoffmann, S

    J. Hoffmann, S. Borgeaud, A. Mensch, E. Buchatskaya, T. Cai, E. Rutherford, D. de las Casas, L. A. Hendricks, J. Welbl, et al. Training compute-optimal large language models. InAdvances in Neural Information Processing Systems (NeurIPS). 2022

  68. [68]

    R. A. Horn and C. R. Johnson.Topics in Matrix Analysis. Cambridge University Press, 1994

  69. [69]

    R. A. Horn and C. R. Johnson.Matrix Analysis. Cambridge University Press, 2nd edition, 2012

  70. [70]

    Y. Hu, H. Song, J. Deng, J. Wang, J. Chen, K. Zhou, Y. Zhu, J. Jiang, Z. Dong, et al. YuLan-Mini: Pushing the limits of open data-efficient language model. InProceedings of the Annual Meeting of the Association for Computational Linguistics (ACL) (Volume 1: Long Papers). 2025

  71. [71]

    Huang, Y

    F. Huang, Y. Luo, and S. Chen. LiMuon: Light and fast Muon optimizer for large models.arXiv preprint arXiv:2509.14562, 2025

  72. [72]

    A. Q. Jiang, A. Sablayrolles, A. Roux, A. Mensch, B. Savary, C. Bamford, D. S. Chaplot, D. de las Casas, E. B. Hanna, et al. Mixtral of experts.arXiv preprint arXiv:2401.04088, 2024

  73. [73]

    Jiang, Z

    R. Jiang, Z. Mhammedi, M. Mohri, and A. Mokhtari. Adaptive matrix online learning through smoothing with guarantees for nonsmooth nonconvex optimization.arXiv preprint arXiv:2602.08232, 2026

  74. [74]

    Jiang, J

    Z. Jiang, J. Gu, H. Zhu, and D. Pan. Pre-RMSNorm and Pre-CRMSNorm transformers: equivalent and efficient Pre-LN transformers. InAdvances in Neural Information Processing Systems (NeurIPS). 2023

  75. [75]

    Jordan, J

    K. Jordan, J. Bernstein, B. Rappazzo, @fernbear.bsky.social, B. Vlado, Y. Jiacheng, F. Cesista, B. Koszarsky, and @Grad62304977.modded-nanogpt: Speedrunning the NanoGPT baseline. 2024

  76. [76]

    Jordan, Y

    K. Jordan, Y. Jin, V. Boza, Y. Jiacheng, F. Cecista, L. Newhouse, and J. Bernstein. Muon: An optimizer for hidden layers in neural networks. 2024. 33

  77. [77]

    Scaling Laws for Neural Language Models

    J. Kaplan, S. McCandlish, T. Henighan, T. B. Brown, B. Chess, R. Child, S. Gray, A. Radford, J. Wu, et al. Scaling laws for neural language models.arXiv preprint arXiv:2001.08361, 2020

  78. [78]

    Kasimbeg, F

    P. Kasimbeg, F. Schneider, R. Eschenhagen, J. Bae, C. S. Sastry, M. Saroufim, B. Feng, L. Wright, E. Z. Yang, et al. Accelerating neural network training: An analysis of the AlgoPerf competition. In International Conference on Learning Representations (ICLR). 2025

  79. [79]

    G. Y. Kim and M.-h. Oh. Convergence of Muon with Newton-Schulz. InInternational Conference on Learning Representations (ICLR). 2026

  80. [80]

    Kimi K2: Open Agentic Intelligence

    Kimi Team. Kimi K2: Open agentic intelligence.arXiv preprint arXiv:2507.20534, 2025

Showing first 80 references.