pith. sign in

arxiv: 2605.18106 · v4 · pith:AGW62PWNnew · submitted 2026-05-18 · 🧮 math.OC · cs.AI· cs.LG· stat.ML

Symmetry-Compatible Principle for Optimizer Design: Embeddings, LM Heads, SwiGLU MLPs, and MoE Routers

Pith reviewed 2026-06-30 18:39 UTC · model grok-4.3

classification 🧮 math.OC cs.AIcs.LGstat.ML
keywords symmetry-compatible optimizerequivariant gradient updatesembedding matricesSwiGLU MLPMoE routerlanguage model pretrainingAdamW comparison
0
0 comments X

The pith

The gradient update rule for each weight block should be equivariant under its symmetry group.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper proposes that coordinate-wise optimizers like Adam fail to respect natural symmetries in neural network parameters, such as permutation symmetry in embedding and LM head matrices or shared-shift symmetry in SwiGLU projections. It derives new update families that enforce matching equivariance, including one-sided spectral and row-norm rules for embeddings, hybrid row-norm and spectral rules for SwiGLU, and row-aware or column-aware rules for MoE routers. Pre-training experiments on multiple dense and sparse language model architectures show these symmetry-compatible updates reduce final validation loss, lessen expert load imbalance, and improve training stability relative to AdamW.

Core claim

Following the symmetry-compatible principle yields an end-to-end layerwise optimizer stack in which each major matrix-valued parameter class is assigned an update whose equivariance matches its symmetry group, and these updates consistently improve final validation loss, reduce expert load imbalance, and improve training stability over AdamW.

What carries the argument

Symmetry-compatible principle requiring the gradient update rule to be equivariant under the symmetry group acting on the corresponding weight block.

If this is right

  • Embeddings and LM heads receive one-sided spectral or row-norm updates.
  • SwiGLU projections receive centered row-norm or hybrid spectral updates.
  • MoE router matrices receive row-aware, column-aware, or left-spectral updates.
  • Sparse MoE models exhibit reduced expert load imbalance.
  • Several runs show controlled vocabulary-logit growth and improved router stability.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same principle could be applied to attention weight matrices by identifying their relevant symmetry groups.
  • Defaulting to coordinate-wise methods may be suboptimal once parameter symmetries are explicitly catalogued for a given architecture.
  • The approach invites systematic classification of symmetry groups across common layer types before optimizer selection.

Load-bearing premise

That the identified symmetry groups for embeddings, SwiGLU, and MoE routers are the ones whose equivariance actually drives the observed gains rather than incidental details of the implementations.

What would settle it

An ablation in the same pre-training setups that replaces the symmetry-equivariant updates with otherwise identical rules that deliberately violate the relevant equivariance and shows no loss in performance or stability.

Figures

Figures reproduced from arXiv: 2605.18106 by Tim Tsz-Kit Lau, Weijie Su.

Figure 1
Figure 1. Figure 1: Two perspectives on deep learning optimization. Left: coordinate-wise adaptive methods [PITH_FULL_IMAGE:figures/full_fig_p003_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Validation losses for downsized gpt-oss pre-training. The configurations differ in the opti [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Training and validation losses for Qwen3-0.6B-style pre-training. In each subfigure, the [PITH_FULL_IMAGE:figures/full_fig_p022_3.png] view at source ↗
Figure 3
Figure 3. Figure 3: Training and validation losses for Qwen3-0.6B-style pre-training. In each subfigure, the [PITH_FULL_IMAGE:figures/full_fig_p025_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Training and validation losses for Gemma 3 1B-style pre-training. In each subfigure, the [PITH_FULL_IMAGE:figures/full_fig_p023_4.png] view at source ↗
Figure 4
Figure 4. Figure 4: Training and validation losses for Gemma 3 1B-style pre-training. In each subfigure, the [PITH_FULL_IMAGE:figures/full_fig_p026_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Training and validation losses for OLMoE-1B-7B-style pre-training. The configurations differ in the optimizers assigned to the embedding, LM head, and router matrices. The final validation losses for configurations (i)–(iv) are 4.0814, 4.0717, 4.1083, and 4.1155 respectively. As shown in [PITH_FULL_IMAGE:figures/full_fig_p025_5.png] view at source ↗
Figure 5
Figure 5. Figure 5: Training and validation losses for OLMoE-1B-7B-style pre-training. The configurations differ in the optimizers assigned to the embedding, LM head, and router matrices. The final validation losses for configurations (i)–(iv) are 4.0955, 4.0815, 4.1187, and 4.1240 respectively. As shown in [PITH_FULL_IMAGE:figures/full_fig_p028_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Training and validation losses for downsized gpt-oss pre-training. The configurations [PITH_FULL_IMAGE:figures/full_fig_p026_6.png] view at source ↗
Figure 6
Figure 6. Figure 6: Training and validation losses for downsized gpt-oss pre-training. The configurations [PITH_FULL_IMAGE:figures/full_fig_p029_6.png] view at source ↗
read the original abstract

A striking geometric disparity has long persisted in the practice of deep learning. While modern neural network architectures naturally exhibit rich symmetry and equivariance properties, popular optimizers such as Adam and its variants operate inherently coordinate-wise, rendering them unable to respect the equivariance structures of the parameter space. We address this disparity by introducing a symmetry-compatible principle for optimizer design: the gradient update rule should be equivariant under the symmetry group acting on the corresponding weight block. Following this principle, we first provide a unified perspective on bi-orthogonally equivariant updates for general matrix layers, as employed by stochastic spectral descent, Muon, Scion, and polar gradient methods. More importantly, by moving from orthogonal groups to permutation and shared-shift symmetries, we derive symmetry-compatible optimizers for parameter blocks whose symmetries differ from those of general matrix layers: embedding and LM head matrices, SwiGLU MLP projections, and MoE router matrices. These constructions include one-sided spectral, row-norm, hybrid row-norm/spectral, row-aware, column-aware, centered row-norm, and left-spectral updates. They yield an end-to-end layerwise optimizer stack in which each major matrix-valued parameter class is assigned an update whose equivariance matches its symmetry group. We corroborate this principle through pre-training experiments on dense and sparse MoE language models, including Qwen3-0.6B-style, Gemma 3 1B-style, OLMoE-1B-7B-style, and downsized gpt-oss architectures. Across these experiments, symmetry-compatible update rules consistently improve final validation loss, reduce expert load imbalance in sparse MoE models, and in several cases control final vocabulary-logit growth, improve router stability, and overall training stability over the corresponding AdamW updates.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The manuscript proposes a symmetry-compatible principle for optimizer design: gradient updates for each parameter block should be equivariant under the symmetry group acting on that block. It unifies bi-orthogonally equivariant methods for general matrices and derives new families (one-sided spectral, row-norm, hybrid row-norm/spectral, centered row-norm, left-spectral) for blocks with permutation symmetry (embeddings, LM heads, MoE routers) or shared-shift symmetry (SwiGLU projections). These are assembled into an end-to-end layerwise optimizer stack and tested via pre-training on Qwen3-0.6B-style, Gemma 3 1B-style, OLMoE-1B-7B-style, and downsized gpt-oss models, with claims of improved validation loss, reduced expert load imbalance, better router stability, and controlled vocabulary-logit growth relative to AdamW.

Significance. If the reported gains can be isolated to the equivariance property rather than auxiliary normalization or scaling, the principle supplies a systematic, architecture-aware route to optimizer construction that goes beyond coordinate-wise methods. The unified treatment of bi-orthogonal updates and the explicit constructions for non-orthogonal symmetries are constructive contributions that could influence layer-specific optimizer design in large-model training.

major comments (2)
  1. [Experiments] Experiments section: the central claim that symmetry-compatible updates produce the observed gains in validation loss, load balance, and stability rests on comparisons solely against AdamW. No ablations or controls are described that hold normalization, centering, or spectral scaling fixed while breaking the stated equivariance (permutation or shared-shift), leaving open whether the functional forms' auxiliary properties, rather than equivariance, drive the results.
  2. [§3] §3 (derivations of row-norm, centered row-norm, and left-spectral updates): the update families are constructed to enforce the target equivariance, yet the same algebraic steps simultaneously introduce row-wise centering and norm-based scaling. Without a side-by-side comparison to a non-equivariant update that retains these auxiliary operations, the attribution of performance differences specifically to symmetry matching cannot be verified.
minor comments (1)
  1. [Abstract] Abstract: statements of 'consistent gains' and 'several cases' of improved stability are not accompanied by effect sizes, run counts, or statistical measures, making it difficult to assess practical significance from the summary alone.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback and for recognizing the potential of the symmetry-compatible principle. We address the two major comments point by point below.

read point-by-point responses
  1. Referee: [Experiments] Experiments section: the central claim that symmetry-compatible updates produce the observed gains in validation loss, load balance, and stability rests on comparisons solely against AdamW. No ablations or controls are described that hold normalization, centering, or spectral scaling fixed while breaking the stated equivariance (permutation or shared-shift), leaving open whether the functional forms' auxiliary properties, rather than equivariance, drive the results.

    Authors: We agree that dedicated ablations isolating equivariance from auxiliary operations (centering, row-norm scaling) would strengthen attribution. However, these auxiliaries are not independent add-ons; they arise necessarily from the algebraic conditions that enforce equivariance under the permutation or shared-shift groups, as derived in §3. A control that retains exactly those auxiliaries while breaking equivariance would require an ad-hoc construction outside the symmetry principle. In revision we will expand the Experiments section with an explicit discussion of this inseparability and note the limitation of the current controls. revision: yes

  2. Referee: [§3] §3 (derivations of row-norm, centered row-norm, and left-spectral updates): the update families are constructed to enforce the target equivariance, yet the same algebraic steps simultaneously introduce row-wise centering and norm-based scaling. Without a side-by-side comparison to a non-equivariant update that retains these auxiliary operations, the attribution of performance differences specifically to symmetry matching cannot be verified.

    Authors: The derivations begin from the equivariance requirement; centering and norm scaling are the direct algebraic consequences of that requirement rather than separate design choices. A non-equivariant update preserving the identical auxiliaries would not satisfy the group action and therefore falls outside the scope of the proposed principle. We will revise §3 to emphasize this derivation link and to acknowledge that empirical isolation via controls remains an open experimental question for future work. revision: yes

Circularity Check

0 steps flagged

No circularity: principle stated independently, updates derived from it, gains measured empirically on held-out metrics

full rationale

The paper introduces the symmetry-compatible principle as an independent design rule (gradient update equivariant under the symmetry group of each weight block). It then derives specific update families (one-sided spectral, row-norm, etc.) that satisfy the stated equivariance by construction for embedding/LM-head (permutation), SwiGLU (shared-shift), and MoE router (permutation) symmetries. These derivations are presented as following from the principle and prior bi-orthogonal work, not as predictions fitted to the target performance numbers. Final claims of improved validation loss, load balance, and stability are supported by direct comparisons to AdamW on held-out pre-training runs across multiple architectures; no step equates a claimed result to its own defining inputs or reduces to a self-citation chain. The derivation chain is therefore self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the new principle itself plus the assumption that the listed symmetry groups are the relevant ones for each block; no new physical entities or fitted constants are introduced.

axioms (1)
  • domain assumption The gradient update rule should be equivariant under the symmetry group acting on the corresponding weight block.
    This is the load-bearing design principle stated in the abstract and used to derive all subsequent update rules.

pith-pipeline@v0.9.1-grok · 5877 in / 1318 out tokens · 18117 ms · 2026-06-30T18:39:21.996296+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. Dead-Direction Conditioners: Gauge-Equivariant Preconditioning for Deep Networks

    cs.LG 2026-06 unverdicted novelty 7.0

    Dead-Direction Conditioners provide gauge-equivariant preconditioning by conditioning optimizer state on symmetry orbits, yielding improved resistance to over-training collapse and higher detection of dead directions ...

Reference graph

Works this paper leans on

180 extracted references · 79 canonical work pages · cited by 1 Pith paper · 29 internal anchors

  1. [1]

    Abbe and E

    E. Abbe and E. Boix-Adsera. On the non-universality of deep learning: quantifying the cost of symmetry. InAdvances in Neural Information Processing Systems (NeurIPS). 2022

  2. [2]

    K. Ahn, N. Amsel, and J. Langford. Dion2: A simple method to shrink matrix in Muon.arXiv preprint 2512.16928, 2025

  3. [3]

    K. Ahn, B. Xu, N. Abreu, Y. Fan, G. Magakyan, P. Sharma, Z. Zhan, and J. Langford. Dion: Distributed orthonormalized updates.arXiv preprint arXiv:2504.05295, 2025

  4. [4]

    Ainslie, J

    J. Ainslie, J. Lee-Thorp, M. De Jong, Y. Zemlyanskiy, F. Lebrón, and S. Sanghai. GQA: Training gen- eralized multi-query transformer models from multi-head checkpoints. InProceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP). 2023

  5. [5]

    Amsel, D

    N. Amsel, D. Persson, C. Musco, and R. M. Gower. The Polar Express: Optimal matrix sign methods and their application to the Muon algorithm. InInternational Conference on Learning Representations (ICLR). 2026

  6. [6]

    K. An, Y. Liu, R. Pan, S. Ma, D. Goldfarb, and T. Zhang. ASGO: Adaptive structured gradient optimization. InAdvances in Neural Information Processing Systems (NeurIPS). 2025

  7. [7]

    R. Anil, V. Gupta, T. Koren, K. Regan, and Y. Singer. Scalable second order optimization for deep learning.arXiv preprint arXiv:2002.09018, 2020

  8. [8]

    Anthony, Y

    Q. Anthony, Y. Tokpanov, S. Szot, S. Rajagopal, P. Medepalli, R. Iyer, V. Shyam, A. Golubeva, A. Chaurasia, et al. Training foundation models on a full-stack AMD platform: Compute, networking, and system design.arXiv preprint arXiv:2511.17127, 2025

  9. [9]

    L. Autonne. Sur les groupes linéaires, réels et orthogonaux.Bulletin de la Société Mathématique de France, 30:121–134, 1902

  10. [10]

    E. Bao, J. Lu, L. Song, N. Hart-Hodgson, W. Parson, and Y. Zhou. Equivariant neural networks and equivarification.arXiv preprint arXiv:1906.07172, 2019

  11. [11]

    Bernstein

    J. Bernstein. Deriving Muon. 2025

  12. [12]

    Bernstein

    J. Bernstein. Modular manifolds.Thinking Machines Lab: Connectionism, 2025. https:// thinkingmachines.ai/blog/modular-manifolds/

  13. [13]

    Bernstein and L

    J. Bernstein and L. Newhouse. Old optimizer, new norm: An anthology. InOPT 2024: Optimization for Machine Learning. 2024

  14. [14]

    Bernstein and L

    J. Bernstein and L. Newhouse. Modular duality in deep learning. InProceedings of the International Conference on Machine Learning (ICML). 2025

  15. [15]

    Bernstein, Y.-X

    J. Bernstein, Y.-X. Wang, K. Azizzadenesheli, and A. Anandkumar. signSGD: Compressed optimisation for non-convex problems. InProceedings of the International Conference on Machine Learning (ICML). 2018

  16. [16]

    Bhatia.Matrix Analysis, volume 169

    R. Bhatia.Matrix Analysis, volume 169. Springer Science & Business Media, 2013

  17. [17]

    Boissin, T

    T. Boissin, T. Massena, F. Mamalet, and M. Serrurier. Turbo-Muon: Accelerating orthogonality-based optimization with pre-conditioning.arXiv preprint arXiv:2512.04632, 2025

  18. [18]

    An Introduction to Vision-Language Modeling,

    F. Bordes, R. Y. Pang, A. Ajay, A. C. Li, A. Bardes, S. Petryk, O. Mañas, Z. Lin, A. Mahmoud, et al. An introduction to vision-language modeling.arXiv preprint arXiv:2405.17247, 2024. 33

  19. [19]

    T. B. Brown, B. Mann, N. Ryder, M. Subbiah, J. Kaplan, P. Dhariwal, A. Neelakantan, P. Shyam, G. Sastry, et al. Language models are few-shot learners. InAdvances in Neural Information Processing Systems (NeurIPS). 2020

  20. [20]

    Buchanan

    S. Buchanan. A faster manifold Muon with ADMM.https://sdbuchanan.com/blog/manifold-muon/, 2025

  21. [21]

    Carlson, V

    D. Carlson, V. Cevher, and L. Carin. Stochastic spectral descent for restricted Boltzmann machines. In Proceedings of the International Conference on Artificial Intelligence and Statistics (AISTATS). 2015

  22. [22]

    Carlson, E

    D. Carlson, E. Collins, Y.-P. Hsieh, L. Carin, and V. Cevher. Preconditioned spectral descent for deep learning. InAdvances in Neural Information Processing Systems (NeurIPS). 2015

  23. [23]

    Carlson, Y.-P

    D. Carlson, Y.-P. Hsieh, E. Collins, L. Carin, and V. Cevher. Stochastic spectral descent for discrete graphical models.IEEE Journal of Selected Topics in Signal Processing, 10(2):296–311, 2016

  24. [24]

    On the Convergence of Muon and Beyond

    D. Chang, Y. Liu, and G. Yuan. On the convergence of Muon and beyond.arXiv preprint arXiv:2509.15816, 2025

  25. [25]

    MuonEq: Balancing Before Orthogonalization with Lightweight Equilibration

    D. Chang, Q. Shi, L. Zhang, Y. Li, R. Zhang, Y. Lu, Y. Liu, and G. Yuan. MuonEq: Balancing before orthogonalization with lightweight equilibration.arXiv preprint arXiv:2603.28254, 2026

  26. [26]

    L. Chen, J. Li, and Q. Liu. Muon optimizes under spectral norm constraints.arXiv preprint arXiv:2506.15054, 2025

  27. [27]

    Y. Chen, Y. Chi, J. Fan, and C. Ma. Spectral methods for data science: A statistical perspective. Foundations and Trends®in Machine Learning, 14(5):566–806, 2021

  28. [28]

    Chowdhery, S

    A. Chowdhery, S. Narang, J. Devlin, M. Bosma, G. Mishra, A. Roberts, P. Barham, H. W. Chung, C. Sutton, et al. PaLM: Scaling language modeling with pathways.Journal of Machine Learning Research, 24(240):1–113, 2023

  29. [29]

    Crawshaw, C

    M. Crawshaw, C. Modi, M. Liu, and R. M. Gower. An exploration of non-Euclidean gradient descent: Muon and its many variants.arXiv preprint arXiv:2510.09827, 2025

  30. [30]

    G. E. Dahl, F. Schneider, Z. Nado, N. Agarwal, C. S. Sastry, P. Hennig, S. Medapati, R. Eschenhagen, P. Kasimbeg, et al. Benchmarking neural network training algorithms.arXiv preprint arXiv:2306.07179, 2023

  31. [31]

    Dao and A

    T. Dao and A. Gu. Transformers are SSMs: Generalized models and efficient algorithms through structured state space duality. InProceedings of the International Conference on Machine Learning (ICML). 2024

  32. [32]

    Y. N. Dauphin, A. Fan, M. Auli, and D. Grangier. Language modeling with gated convolutional networks. InProceedings of the International Conference on Machine Learning (ICML). 2017

  33. [33]

    Davis and D

    D. Davis and D. Drusvyatskiy. When do spectral gradient updates help in deep learning?arXiv preprint arXiv:2512.04299, 2025

  34. [34]

    DeepSeek-V4: Towards highly efficient million-token context intelligence

    DeepSeek-AI. DeepSeek-V4: Towards highly efficient million-token context intelligence. 2026

  35. [35]

    DeepSeek-AI, A. Liu, B. Feng, B. Xue, B. Wang, B. Wu, C. Lu, C. Zhao, C. Deng, et al. DeepSeek-V3 technical report.arXiv preprint arXiv:2412.19437, 2024

  36. [36]

    Dehghani, J

    M. Dehghani, J. Djolonga, B. Mustafa, P. Padlewski, J. Heek, J. Gilmer, A. Steiner, M. Caron, R. Geirhos, et al. Scaling vision transformers to 22 billion parameters. InProceedings of the International Conference on Machine Learning (ICML). 2023. 34

  37. [37]

    S. Deng, Z. Ouyang, T. Pang, Z. Liu, R. Jin, S. Yu, and Y. Yang. RMNP: Row-momentum normalized preconditioning for scalable matrix-based optimization.arXiv preprint arXiv:2603.20527, 2026

  38. [38]

    Dewulf, D

    A. Dewulf, D. Pai, L. Yang, A. Zhang, and B. Keigwin. Aurora: A leverage-aware optimizer for rectangular matrices. 2026

  39. [39]

    C. Ding, D. Sun, J. Sun, and K.-C. Toh. Spectral operators of matrices.Mathematical Programming, 168(1):509–531, 2018

  40. [40]

    C. Ding, D. Sun, J. Sun, and K.-C. Toh. Spectral operators of matrices: Semismoothness and characterizations of the generalized Jacobian.SIAM Journal on Optimization, 30(1):630–659, 2020

  41. [41]

    Dosovitskiy, L

    A. Dosovitskiy, L. Beyer, A. Kolesnikov, D. Weissenborn, X. Zhai, T. Unterthiner, M. Dehghani, M. Minderer, G. Heigold, et al. An image is worth 16x16 words: Transformers for image recognition at scale. InInternational Conference on Learning Representations (ICLR). 2021

  42. [42]

    Z. Du, H. He, and W. Su. Uncovering symmetry transfer in large language models via layer-peeled optimization.arXiv preprint arXiv:2605.12756, 2026

  43. [43]

    Du and W

    Z. Du and W. Su. The Newton–Muon optimizer.arXiv preprint arXiv:2604.01472, 2026

  44. [44]

    Duchi, E

    J. Duchi, E. Hazan, and Y. Singer. Adaptive subgradient methods for online learning and stochastic optimization.Journal of Machine Learning Research, 12:2121–2159, 2011

  45. [45]

    Eschenhagen, A

    R. Eschenhagen, A. Cai, T.-H. Lee, and H.-J. M. Shi. Clarifying Shampoo: Adapting spectral descent to stochasticity and the parameter trajectory.arXiv preprint arXiv:2602.09314, 2026

  46. [46]

    Eschenhagen, A

    R. Eschenhagen, A. Immer, R. Turner, F. Schneider, and P. Hennig. Kronecker-factored approximate curvature for modern neural network architectures. InAdvances in Neural Information Processing Systems (NeurIPS). 2023

  47. [47]

    Layer sharding for large-scale training with Muon.https://www.essential.ai/research/ infra, 2025

    Essential AI. Layer sharding for large-scale training with Muon.https://www.essential.ai/research/ infra, 2025

  48. [48]

    Essential AI, I. Shah, A. M. Polloreno, K. Stratos, P. Monk, A. Chaluvaraju, A. Hojel, A. Ma, A. Thomas, et al. Practical efficiency of Muon for pretraining.arXiv preprint arXiv:2505.02222, 2025

  49. [49]

    Filatov, J

    O. Filatov, J. Wang, J. Ebert, and S. Kesselheim. Optimal scaling needs optimal norm.arXiv preprint arXiv:2510.03871, 2025

  50. [50]

    Frans, S

    K. Frans, S. Levine, and P. Abbeel. A stable whitening optimizer for efficient neural network training. InAdvances in Neural Information Processing Systems (NeurIPS). 2025

  51. [51]

    Gemma 3 Technical Report

    Gemma Team, A. Kamath, J. Ferret, S. Pathak, N. Vieillard, R. Merhej, S. Perrin, T. Matejovicova, A. Ramé, et al. Gemma 3 technical report.arXiv preprint arXiv:2503.19786, 2025

  52. [52]

    Gemma 2: Improving Open Language Models at a Practical Size

    Gemma Team, M. Riviere, S. Pathak, P. G. Sessa, C. Hardin, S. Bhupatiraju, L. Hussenot, T. Mesnard, B. Shahriari, et al. Gemma 2: Improving open language models at a practical size.arXiv preprint arXiv:2408.00118, 2024

  53. [53]

    Memory-Efficient LLM Pretraining via Minimalist Optimizer Design

    A. Glentis, J. Li, A. Han, and M. Hong. A minimalist optimizer design for LLM pretraining.arXiv preprint arXiv:2506.16659, 2025

  54. [54]

    GLM-4.5 Team, A. Zeng, X. Lv, Q. Zheng, Z. Hou, B. Chen, C. Xie, C. Wang, D. Yin, et al. GLM-4.5: Agentic, reasoning, and coding (ARC) foundation models.arXiv preprint arXiv:2508.06471, 2025

  55. [55]

    GLM-5 Team, A. Zeng, X. Lv, Z. Hou, Z. Du, Q. Zheng, B. Chen, D. Yin, C. Ge, et al. GLM-5: from vibe coding to agentic engineering.arXiv preprint arXiv:2602.15763, 2026. 35

  56. [56]

    Goldfarb, Y

    D. Goldfarb, Y. Ren, and A. Bahamou. Practical quasi-Newton methods for training deep neural networks. InAdvances in Neural Information Processing Systems (NeurIPS). 2020

  57. [57]

    S. Gong, S. Agarwal, Y. Zhang, J. Ye, L. Zheng, M. Li, C. An, P. Zhao, W. Bi, et al. Scaling diffusion language models via adaptation from autoregressive models. InInternational Conference on Learning Representations (ICLR). 2025

  58. [58]

    W. Gong, J. Zazo, Q. Luo, P. Wang, J. Hensman, and C. Ma. ARO: A new lens on matrix optimization for large models.arXiv preprint arXiv:2602.09006, 2026

  59. [59]

    Gonon, A.-A

    A. Gonon, A.-A. Muşat, and N. Boumal. Insights on Muon from simple quadratics.arXiv preprint arXiv:2602.11948, 2026

  60. [60]

    Gemma 4 model card

    Google DeepMind. Gemma 4 model card. 2026

  61. [61]

    Grishina, M

    E. Grishina, M. Smirnov, and M. Rakhuba. Accelerating Newton-Schulz iteration for orthogonalization via Chebyshev-type polynomials.arXiv preprint arXiv:2506.10935, 2025

  62. [62]

    Gu and T

    A. Gu and T. Dao. Mamba: Linear-time sequence modeling with selective state spaces. InProceedings of the Conference on Language Modeling (COLM). 2024

  63. [63]

    Gu and Z

    Y. Gu and Z. Xie. Mano: Restriking manifold optimization for LLM training.arXiv preprint arXiv:2601.23000, 2026

  64. [64]

    Gupta, T

    V. Gupta, T. Koren, and Y. Singer. Shampoo: Preconditioned stochastic tensor optimization. In Proceedings of the International Conference on Machine Learning (ICML). 2018

  65. [65]

    K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). 2016

  66. [66]

    N. J. Higham. Computing the polar decomposition—with applications.SIAM Journal on Scientific and Statistical Computing, 7(4):1160–1174, 1986

  67. [67]

    N. J. Higham. Stable iterations for the matrix square root.Numerical Algorithms, 15(2):227–242, 1997

  68. [68]

    N. J. Higham.Functions of Matrices: Theory and Computation. Society for Industrial and Applied Mathematics, 2008

  69. [69]

    Hoffmann, S

    J. Hoffmann, S. Borgeaud, A. Mensch, E. Buchatskaya, T. Cai, E. Rutherford, D. de las Casas, L. A. Hendricks, J. Welbl, et al. Training compute-optimal large language models. InAdvances in Neural Information Processing Systems (NeurIPS). 2022

  70. [70]

    R. A. Horn and C. R. Johnson.Topics in Matrix Analysis. Cambridge University Press, 1994

  71. [71]

    R. A. Horn and C. R. Johnson.Matrix Analysis. Cambridge University Press, 2nd edition, 2012

  72. [72]

    Y. Hu, H. Song, J. Deng, J. Wang, J. Chen, K. Zhou, Y. Zhu, J. Jiang, Z. Dong, et al. YuLan-Mini: Pushing the limits of open data-efficient language model. InProceedings of the Annual Meeting of the Association for Computational Linguistics (ACL) (Volume 1: Long Papers). 2025

  73. [73]

    LiMuon: Light and Fast Muon Optimizer for Large Models

    F. Huang, Y. Luo, and S. Chen. LiMuon: Light and fast Muon optimizer for large models.arXiv preprint arXiv:2509.14562, 2025

  74. [74]

    A. Q. Jiang, A. Sablayrolles, A. Roux, A. Mensch, B. Savary, C. Bamford, D. S. Chaplot, D. de las Casas, E. B. Hanna, et al. Mixtral of experts.arXiv preprint arXiv:2401.04088, 2024

  75. [75]

    Jiang, Z

    R. Jiang, Z. Mhammedi, M. Mohri, and A. Mokhtari. Adaptive matrix online learning through smoothing with guarantees for nonsmooth nonconvex optimization.arXiv preprint arXiv:2602.08232, 2026. 36

  76. [76]

    Jiang, J

    Z. Jiang, J. Gu, H. Zhu, and D. Pan. Pre-RMSNorm and Pre-CRMSNorm transformers: equivalent and efficient Pre-LN transformers. InAdvances in Neural Information Processing Systems (NeurIPS). 2023

  77. [77]

    Jordan, J

    K. Jordan, J. Bernstein, B. Rappazzo, @fernbear.bsky.social, B. Vlado, Y. Jiacheng, F. Cesista, B. Koszarsky, and @Grad62304977.modded-nanogpt: Speedrunning the NanoGPT baseline. 2024

  78. [78]

    Jordan, Y

    K. Jordan, Y. Jin, V. Boza, Y. Jiacheng, F. Cecista, L. Newhouse, and J. Bernstein. Muon: An optimizer for hidden layers in neural networks. 2024

  79. [79]

    Scaling Laws for Neural Language Models

    J. Kaplan, S. McCandlish, T. Henighan, T. B. Brown, B. Chess, R. Child, S. Gray, A. Radford, J. Wu, et al. Scaling laws for neural language models.arXiv preprint arXiv:2001.08361, 2020

  80. [80]

    Kasimbeg, F

    P. Kasimbeg, F. Schneider, R. Eschenhagen, J. Bae, C. S. Sastry, M. Saroufim, B. Feng, L. Wright, E. Z. Yang, et al. Accelerating neural network training: An analysis of the AlgoPerf competition. In International Conference on Learning Representations (ICLR). 2025

Showing first 80 references.