Symmetry-Compatible Principle for Optimizer Design: Embeddings, LM Heads, SwiGLU MLPs, and MoE Routers

Tim Tsz-Kit Lau; Weijie Su

arxiv: 2605.18106 · v4 · pith:AGW62PWNnew · submitted 2026-05-18 · 🧮 math.OC · cs.AI· cs.LG· stat.ML

Symmetry-Compatible Principle for Optimizer Design: Embeddings, LM Heads, SwiGLU MLPs, and MoE Routers

Tim Tsz-Kit Lau , Weijie Su This is my paper

Pith reviewed 2026-06-30 18:39 UTC · model grok-4.3

classification 🧮 math.OC cs.AIcs.LGstat.ML

keywords symmetry-compatible optimizerequivariant gradient updatesembedding matricesSwiGLU MLPMoE routerlanguage model pretrainingAdamW comparison

0 comments

The pith

The gradient update rule for each weight block should be equivariant under its symmetry group.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper proposes that coordinate-wise optimizers like Adam fail to respect natural symmetries in neural network parameters, such as permutation symmetry in embedding and LM head matrices or shared-shift symmetry in SwiGLU projections. It derives new update families that enforce matching equivariance, including one-sided spectral and row-norm rules for embeddings, hybrid row-norm and spectral rules for SwiGLU, and row-aware or column-aware rules for MoE routers. Pre-training experiments on multiple dense and sparse language model architectures show these symmetry-compatible updates reduce final validation loss, lessen expert load imbalance, and improve training stability relative to AdamW.

Core claim

Following the symmetry-compatible principle yields an end-to-end layerwise optimizer stack in which each major matrix-valued parameter class is assigned an update whose equivariance matches its symmetry group, and these updates consistently improve final validation loss, reduce expert load imbalance, and improve training stability over AdamW.

What carries the argument

Symmetry-compatible principle requiring the gradient update rule to be equivariant under the symmetry group acting on the corresponding weight block.

If this is right

Embeddings and LM heads receive one-sided spectral or row-norm updates.
SwiGLU projections receive centered row-norm or hybrid spectral updates.
MoE router matrices receive row-aware, column-aware, or left-spectral updates.
Sparse MoE models exhibit reduced expert load imbalance.
Several runs show controlled vocabulary-logit growth and improved router stability.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same principle could be applied to attention weight matrices by identifying their relevant symmetry groups.
Defaulting to coordinate-wise methods may be suboptimal once parameter symmetries are explicitly catalogued for a given architecture.
The approach invites systematic classification of symmetry groups across common layer types before optimizer selection.

Load-bearing premise

That the identified symmetry groups for embeddings, SwiGLU, and MoE routers are the ones whose equivariance actually drives the observed gains rather than incidental details of the implementations.

What would settle it

An ablation in the same pre-training setups that replaces the symmetry-equivariant updates with otherwise identical rules that deliberately violate the relevant equivariance and shows no loss in performance or stability.

Figures

Figures reproduced from arXiv: 2605.18106 by Tim Tsz-Kit Lau, Weijie Su.

**Figure 2.** Figure 2: Validation losses for downsized gpt-oss pre-training. The configurations differ in the opti [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗

**Figure 3.** Figure 3: Training and validation losses for Qwen3-0.6B-style pre-training. In each subfigure, the [PITH_FULL_IMAGE:figures/full_fig_p022_3.png] view at source ↗

**Figure 3.** Figure 3: Training and validation losses for Qwen3-0.6B-style pre-training. In each subfigure, the [PITH_FULL_IMAGE:figures/full_fig_p025_3.png] view at source ↗

**Figure 4.** Figure 4: Training and validation losses for Gemma 3 1B-style pre-training. In each subfigure, the [PITH_FULL_IMAGE:figures/full_fig_p023_4.png] view at source ↗

**Figure 4.** Figure 4: Training and validation losses for Gemma 3 1B-style pre-training. In each subfigure, the [PITH_FULL_IMAGE:figures/full_fig_p026_4.png] view at source ↗

**Figure 5.** Figure 5: Training and validation losses for OLMoE-1B-7B-style pre-training. The configurations differ in the optimizers assigned to the embedding, LM head, and router matrices. The final validation losses for configurations (i)–(iv) are 4.0814, 4.0717, 4.1083, and 4.1155 respectively. As shown in [PITH_FULL_IMAGE:figures/full_fig_p025_5.png] view at source ↗

**Figure 6.** Figure 6: Training and validation losses for downsized gpt-oss pre-training. The configurations [PITH_FULL_IMAGE:figures/full_fig_p026_6.png] view at source ↗

**Figure 6.** Figure 6: Training and validation losses for downsized gpt-oss pre-training. The configurations [PITH_FULL_IMAGE:figures/full_fig_p029_6.png] view at source ↗

read the original abstract

A striking geometric disparity has long persisted in the practice of deep learning. While modern neural network architectures naturally exhibit rich symmetry and equivariance properties, popular optimizers such as Adam and its variants operate inherently coordinate-wise, rendering them unable to respect the equivariance structures of the parameter space. We address this disparity by introducing a symmetry-compatible principle for optimizer design: the gradient update rule should be equivariant under the symmetry group acting on the corresponding weight block. Following this principle, we first provide a unified perspective on bi-orthogonally equivariant updates for general matrix layers, as employed by stochastic spectral descent, Muon, Scion, and polar gradient methods. More importantly, by moving from orthogonal groups to permutation and shared-shift symmetries, we derive symmetry-compatible optimizers for parameter blocks whose symmetries differ from those of general matrix layers: embedding and LM head matrices, SwiGLU MLP projections, and MoE router matrices. These constructions include one-sided spectral, row-norm, hybrid row-norm/spectral, row-aware, column-aware, centered row-norm, and left-spectral updates. They yield an end-to-end layerwise optimizer stack in which each major matrix-valued parameter class is assigned an update whose equivariance matches its symmetry group. We corroborate this principle through pre-training experiments on dense and sparse MoE language models, including Qwen3-0.6B-style, Gemma 3 1B-style, OLMoE-1B-7B-style, and downsized gpt-oss architectures. Across these experiments, symmetry-compatible update rules consistently improve final validation loss, reduce expert load imbalance in sparse MoE models, and in several cases control final vocabulary-logit growth, improve router stability, and overall training stability over the corresponding AdamW updates.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper derives explicit symmetry-equivariant updates for embeddings, SwiGLU, and MoE routers but the experimental claims rest on unreported details.

read the letter

The main point is that this paper works out concrete update rules for embedding and LM head matrices, SwiGLU projections, and MoE router matrices so that each respects its natural symmetry group (permutation or shared-shift). It then claims these rules beat AdamW on validation loss and stability in pretraining runs on a handful of model families.

What is new is the move from the usual bi-orthogonal equivariance (already in spectral descent and Muon) to permutation equivariance for embeddings and routers and shared-shift for SwiGLU. The listed families—one-sided spectral, row-norm, hybrid, centered row-norm, left-spectral—are spelled out for those blocks and assembled into an end-to-end layerwise stack. That construction is the actual contribution.

The derivations themselves look clean and self-contained. The paper does a reasonable job explaining why those particular symmetries fit the parameter blocks and how the updates are built to match them.

The soft spot is the evidence. The abstract states that the rules give consistent gains across Qwen3-, Gemma-, OLMoE-, and gpt-oss-style models, but supplies no numbers, tables, error bars, or ablation results. Without those, it is impossible to tell whether the reported improvements come from matching the stated symmetries or from incidental features such as row normalization and centering that the updates also introduce. The stress-test concern therefore lands: the experiments compare only against AdamW and do not test whether non-equivariant updates with similar auxiliary properties would produce the same effect.

This is for researchers who build or tune optimizers for transformers and MoE models and want layer-specific alternatives to coordinate-wise methods. A reader looking for implementable update formulas might find the constructions worth trying.

It deserves a serious referee because the mathematical constructions are explicit enough to check and the overall principle is stated without internal contradiction. I would send it to peer review rather than desk reject so the experimental controls and quantitative results can be examined.

Referee Report

2 major / 1 minor

Summary. The manuscript proposes a symmetry-compatible principle for optimizer design: gradient updates for each parameter block should be equivariant under the symmetry group acting on that block. It unifies bi-orthogonally equivariant methods for general matrices and derives new families (one-sided spectral, row-norm, hybrid row-norm/spectral, centered row-norm, left-spectral) for blocks with permutation symmetry (embeddings, LM heads, MoE routers) or shared-shift symmetry (SwiGLU projections). These are assembled into an end-to-end layerwise optimizer stack and tested via pre-training on Qwen3-0.6B-style, Gemma 3 1B-style, OLMoE-1B-7B-style, and downsized gpt-oss models, with claims of improved validation loss, reduced expert load imbalance, better router stability, and controlled vocabulary-logit growth relative to AdamW.

Significance. If the reported gains can be isolated to the equivariance property rather than auxiliary normalization or scaling, the principle supplies a systematic, architecture-aware route to optimizer construction that goes beyond coordinate-wise methods. The unified treatment of bi-orthogonal updates and the explicit constructions for non-orthogonal symmetries are constructive contributions that could influence layer-specific optimizer design in large-model training.

major comments (2)

[Experiments] Experiments section: the central claim that symmetry-compatible updates produce the observed gains in validation loss, load balance, and stability rests on comparisons solely against AdamW. No ablations or controls are described that hold normalization, centering, or spectral scaling fixed while breaking the stated equivariance (permutation or shared-shift), leaving open whether the functional forms' auxiliary properties, rather than equivariance, drive the results.
[§3] §3 (derivations of row-norm, centered row-norm, and left-spectral updates): the update families are constructed to enforce the target equivariance, yet the same algebraic steps simultaneously introduce row-wise centering and norm-based scaling. Without a side-by-side comparison to a non-equivariant update that retains these auxiliary operations, the attribution of performance differences specifically to symmetry matching cannot be verified.

minor comments (1)

[Abstract] Abstract: statements of 'consistent gains' and 'several cases' of improved stability are not accompanied by effect sizes, run counts, or statistical measures, making it difficult to assess practical significance from the summary alone.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback and for recognizing the potential of the symmetry-compatible principle. We address the two major comments point by point below.

read point-by-point responses

Referee: [Experiments] Experiments section: the central claim that symmetry-compatible updates produce the observed gains in validation loss, load balance, and stability rests on comparisons solely against AdamW. No ablations or controls are described that hold normalization, centering, or spectral scaling fixed while breaking the stated equivariance (permutation or shared-shift), leaving open whether the functional forms' auxiliary properties, rather than equivariance, drive the results.

Authors: We agree that dedicated ablations isolating equivariance from auxiliary operations (centering, row-norm scaling) would strengthen attribution. However, these auxiliaries are not independent add-ons; they arise necessarily from the algebraic conditions that enforce equivariance under the permutation or shared-shift groups, as derived in §3. A control that retains exactly those auxiliaries while breaking equivariance would require an ad-hoc construction outside the symmetry principle. In revision we will expand the Experiments section with an explicit discussion of this inseparability and note the limitation of the current controls. revision: yes
Referee: [§3] §3 (derivations of row-norm, centered row-norm, and left-spectral updates): the update families are constructed to enforce the target equivariance, yet the same algebraic steps simultaneously introduce row-wise centering and norm-based scaling. Without a side-by-side comparison to a non-equivariant update that retains these auxiliary operations, the attribution of performance differences specifically to symmetry matching cannot be verified.

Authors: The derivations begin from the equivariance requirement; centering and norm scaling are the direct algebraic consequences of that requirement rather than separate design choices. A non-equivariant update preserving the identical auxiliaries would not satisfy the group action and therefore falls outside the scope of the proposed principle. We will revise §3 to emphasize this derivation link and to acknowledge that empirical isolation via controls remains an open experimental question for future work. revision: yes

Circularity Check

0 steps flagged

No circularity: principle stated independently, updates derived from it, gains measured empirically on held-out metrics

full rationale

The paper introduces the symmetry-compatible principle as an independent design rule (gradient update equivariant under the symmetry group of each weight block). It then derives specific update families (one-sided spectral, row-norm, etc.) that satisfy the stated equivariance by construction for embedding/LM-head (permutation), SwiGLU (shared-shift), and MoE router (permutation) symmetries. These derivations are presented as following from the principle and prior bi-orthogonal work, not as predictions fitted to the target performance numbers. Final claims of improved validation loss, load balance, and stability are supported by direct comparisons to AdamW on held-out pre-training runs across multiple architectures; no step equates a claimed result to its own defining inputs or reduces to a self-citation chain. The derivation chain is therefore self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the new principle itself plus the assumption that the listed symmetry groups are the relevant ones for each block; no new physical entities or fitted constants are introduced.

axioms (1)

domain assumption The gradient update rule should be equivariant under the symmetry group acting on the corresponding weight block.
This is the load-bearing design principle stated in the abstract and used to derive all subsequent update rules.

pith-pipeline@v0.9.1-grok · 5877 in / 1318 out tokens · 18117 ms · 2026-06-30T18:39:21.996296+00:00 · methodology

Review history (2 revisions) →

discussion (0)

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Dead-Direction Conditioners: Gauge-Equivariant Preconditioning for Deep Networks
cs.LG 2026-06 unverdicted novelty 7.0

Dead-Direction Conditioners provide gauge-equivariant preconditioning by conditioning optimizer state on symmetry orbits, yielding improved resistance to over-training collapse and higher detection of dead directions ...

Reference graph

Works this paper leans on

180 extracted references · 79 canonical work pages · cited by 1 Pith paper · 29 internal anchors

[1]

Abbe and E

E. Abbe and E. Boix-Adsera. On the non-universality of deep learning: quantifying the cost of symmetry. InAdvances in Neural Information Processing Systems (NeurIPS). 2022

2022
[2]

K. Ahn, N. Amsel, and J. Langford. Dion2: A simple method to shrink matrix in Muon.arXiv preprint 2512.16928, 2025

work page arXiv 2025
[3]

K. Ahn, B. Xu, N. Abreu, Y. Fan, G. Magakyan, P. Sharma, Z. Zhan, and J. Langford. Dion: Distributed orthonormalized updates.arXiv preprint arXiv:2504.05295, 2025

work page arXiv 2025
[4]

Ainslie, J

J. Ainslie, J. Lee-Thorp, M. De Jong, Y. Zemlyanskiy, F. Lebrón, and S. Sanghai. GQA: Training gen- eralized multi-query transformer models from multi-head checkpoints. InProceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP). 2023

2023
[5]

Amsel, D

N. Amsel, D. Persson, C. Musco, and R. M. Gower. The Polar Express: Optimal matrix sign methods and their application to the Muon algorithm. InInternational Conference on Learning Representations (ICLR). 2026

2026
[6]

K. An, Y. Liu, R. Pan, S. Ma, D. Goldfarb, and T. Zhang. ASGO: Adaptive structured gradient optimization. InAdvances in Neural Information Processing Systems (NeurIPS). 2025

2025
[7]

R. Anil, V. Gupta, T. Koren, K. Regan, and Y. Singer. Scalable second order optimization for deep learning.arXiv preprint arXiv:2002.09018, 2020

work page arXiv 2002
[8]

Anthony, Y

Q. Anthony, Y. Tokpanov, S. Szot, S. Rajagopal, P. Medepalli, R. Iyer, V. Shyam, A. Golubeva, A. Chaurasia, et al. Training foundation models on a full-stack AMD platform: Compute, networking, and system design.arXiv preprint arXiv:2511.17127, 2025

work page arXiv 2025
[9]

L. Autonne. Sur les groupes linéaires, réels et orthogonaux.Bulletin de la Société Mathématique de France, 30:121–134, 1902

1902
[10]

E. Bao, J. Lu, L. Song, N. Hart-Hodgson, W. Parson, and Y. Zhou. Equivariant neural networks and equivarification.arXiv preprint arXiv:1906.07172, 2019

work page arXiv 1906
[11]

Bernstein

J. Bernstein. Deriving Muon. 2025

2025
[12]

Bernstein

J. Bernstein. Modular manifolds.Thinking Machines Lab: Connectionism, 2025. https:// thinkingmachines.ai/blog/modular-manifolds/

2025
[13]

Bernstein and L

J. Bernstein and L. Newhouse. Old optimizer, new norm: An anthology. InOPT 2024: Optimization for Machine Learning. 2024

2024
[14]

Bernstein and L

J. Bernstein and L. Newhouse. Modular duality in deep learning. InProceedings of the International Conference on Machine Learning (ICML). 2025

2025
[15]

Bernstein, Y.-X

J. Bernstein, Y.-X. Wang, K. Azizzadenesheli, and A. Anandkumar. signSGD: Compressed optimisation for non-convex problems. InProceedings of the International Conference on Machine Learning (ICML). 2018

2018
[16]

Bhatia.Matrix Analysis, volume 169

R. Bhatia.Matrix Analysis, volume 169. Springer Science & Business Media, 2013

2013
[17]

Boissin, T

T. Boissin, T. Massena, F. Mamalet, and M. Serrurier. Turbo-Muon: Accelerating orthogonality-based optimization with pre-conditioning.arXiv preprint arXiv:2512.04632, 2025

work page arXiv 2025
[18]

An Introduction to Vision-Language Modeling,

F. Bordes, R. Y. Pang, A. Ajay, A. C. Li, A. Bardes, S. Petryk, O. Mañas, Z. Lin, A. Mahmoud, et al. An introduction to vision-language modeling.arXiv preprint arXiv:2405.17247, 2024. 33

work page arXiv 2024
[19]

T. B. Brown, B. Mann, N. Ryder, M. Subbiah, J. Kaplan, P. Dhariwal, A. Neelakantan, P. Shyam, G. Sastry, et al. Language models are few-shot learners. InAdvances in Neural Information Processing Systems (NeurIPS). 2020

2020
[20]

Buchanan

S. Buchanan. A faster manifold Muon with ADMM.https://sdbuchanan.com/blog/manifold-muon/, 2025

2025
[21]

Carlson, V

D. Carlson, V. Cevher, and L. Carin. Stochastic spectral descent for restricted Boltzmann machines. In Proceedings of the International Conference on Artificial Intelligence and Statistics (AISTATS). 2015

2015
[22]

Carlson, E

D. Carlson, E. Collins, Y.-P. Hsieh, L. Carin, and V. Cevher. Preconditioned spectral descent for deep learning. InAdvances in Neural Information Processing Systems (NeurIPS). 2015

2015
[23]

Carlson, Y.-P

D. Carlson, Y.-P. Hsieh, E. Collins, L. Carin, and V. Cevher. Stochastic spectral descent for discrete graphical models.IEEE Journal of Selected Topics in Signal Processing, 10(2):296–311, 2016

2016
[24]

On the Convergence of Muon and Beyond

D. Chang, Y. Liu, and G. Yuan. On the convergence of Muon and beyond.arXiv preprint arXiv:2509.15816, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[25]

MuonEq: Balancing Before Orthogonalization with Lightweight Equilibration

D. Chang, Q. Shi, L. Zhang, Y. Li, R. Zhang, Y. Lu, Y. Liu, and G. Yuan. MuonEq: Balancing before orthogonalization with lightweight equilibration.arXiv preprint arXiv:2603.28254, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026
[26]

L. Chen, J. Li, and Q. Liu. Muon optimizes under spectral norm constraints.arXiv preprint arXiv:2506.15054, 2025

work page arXiv 2025
[27]

Y. Chen, Y. Chi, J. Fan, and C. Ma. Spectral methods for data science: A statistical perspective. Foundations and Trends®in Machine Learning, 14(5):566–806, 2021

2021
[28]

Chowdhery, S

A. Chowdhery, S. Narang, J. Devlin, M. Bosma, G. Mishra, A. Roberts, P. Barham, H. W. Chung, C. Sutton, et al. PaLM: Scaling language modeling with pathways.Journal of Machine Learning Research, 24(240):1–113, 2023

2023
[29]

Crawshaw, C

M. Crawshaw, C. Modi, M. Liu, and R. M. Gower. An exploration of non-Euclidean gradient descent: Muon and its many variants.arXiv preprint arXiv:2510.09827, 2025

work page arXiv 2025
[30]

G. E. Dahl, F. Schneider, Z. Nado, N. Agarwal, C. S. Sastry, P. Hennig, S. Medapati, R. Eschenhagen, P. Kasimbeg, et al. Benchmarking neural network training algorithms.arXiv preprint arXiv:2306.07179, 2023

work page arXiv 2023
[31]

Dao and A

T. Dao and A. Gu. Transformers are SSMs: Generalized models and efficient algorithms through structured state space duality. InProceedings of the International Conference on Machine Learning (ICML). 2024

2024
[32]

Y. N. Dauphin, A. Fan, M. Auli, and D. Grangier. Language modeling with gated convolutional networks. InProceedings of the International Conference on Machine Learning (ICML). 2017

2017
[33]

Davis and D

D. Davis and D. Drusvyatskiy. When do spectral gradient updates help in deep learning?arXiv preprint arXiv:2512.04299, 2025

work page arXiv 2025
[34]

DeepSeek-V4: Towards highly efficient million-token context intelligence

DeepSeek-AI. DeepSeek-V4: Towards highly efficient million-token context intelligence. 2026

2026
[35]

DeepSeek-AI, A. Liu, B. Feng, B. Xue, B. Wang, B. Wu, C. Lu, C. Zhao, C. Deng, et al. DeepSeek-V3 technical report.arXiv preprint arXiv:2412.19437, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[36]

Dehghani, J

M. Dehghani, J. Djolonga, B. Mustafa, P. Padlewski, J. Heek, J. Gilmer, A. Steiner, M. Caron, R. Geirhos, et al. Scaling vision transformers to 22 billion parameters. InProceedings of the International Conference on Machine Learning (ICML). 2023. 34

2023
[37]

S. Deng, Z. Ouyang, T. Pang, Z. Liu, R. Jin, S. Yu, and Y. Yang. RMNP: Row-momentum normalized preconditioning for scalable matrix-based optimization.arXiv preprint arXiv:2603.20527, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026
[38]

Dewulf, D

A. Dewulf, D. Pai, L. Yang, A. Zhang, and B. Keigwin. Aurora: A leverage-aware optimizer for rectangular matrices. 2026

2026
[39]

C. Ding, D. Sun, J. Sun, and K.-C. Toh. Spectral operators of matrices.Mathematical Programming, 168(1):509–531, 2018

2018
[40]

C. Ding, D. Sun, J. Sun, and K.-C. Toh. Spectral operators of matrices: Semismoothness and characterizations of the generalized Jacobian.SIAM Journal on Optimization, 30(1):630–659, 2020

2020
[41]

Dosovitskiy, L

A. Dosovitskiy, L. Beyer, A. Kolesnikov, D. Weissenborn, X. Zhai, T. Unterthiner, M. Dehghani, M. Minderer, G. Heigold, et al. An image is worth 16x16 words: Transformers for image recognition at scale. InInternational Conference on Learning Representations (ICLR). 2021

2021
[42]

Z. Du, H. He, and W. Su. Uncovering symmetry transfer in large language models via layer-peeled optimization.arXiv preprint arXiv:2605.12756, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026
[43]

Du and W

Z. Du and W. Su. The Newton–Muon optimizer.arXiv preprint arXiv:2604.01472, 2026

work page arXiv 2026
[44]

Duchi, E

J. Duchi, E. Hazan, and Y. Singer. Adaptive subgradient methods for online learning and stochastic optimization.Journal of Machine Learning Research, 12:2121–2159, 2011

2011
[45]

Eschenhagen, A

R. Eschenhagen, A. Cai, T.-H. Lee, and H.-J. M. Shi. Clarifying Shampoo: Adapting spectral descent to stochasticity and the parameter trajectory.arXiv preprint arXiv:2602.09314, 2026

work page arXiv 2026
[46]

Eschenhagen, A

R. Eschenhagen, A. Immer, R. Turner, F. Schneider, and P. Hennig. Kronecker-factored approximate curvature for modern neural network architectures. InAdvances in Neural Information Processing Systems (NeurIPS). 2023

2023
[47]

Layer sharding for large-scale training with Muon.https://www.essential.ai/research/ infra, 2025

Essential AI. Layer sharding for large-scale training with Muon.https://www.essential.ai/research/ infra, 2025

2025
[48]

Essential AI, I. Shah, A. M. Polloreno, K. Stratos, P. Monk, A. Chaluvaraju, A. Hojel, A. Ma, A. Thomas, et al. Practical efficiency of Muon for pretraining.arXiv preprint arXiv:2505.02222, 2025

work page arXiv 2025
[49]

Filatov, J

O. Filatov, J. Wang, J. Ebert, and S. Kesselheim. Optimal scaling needs optimal norm.arXiv preprint arXiv:2510.03871, 2025

work page arXiv 2025
[50]

Frans, S

K. Frans, S. Levine, and P. Abbeel. A stable whitening optimizer for efficient neural network training. InAdvances in Neural Information Processing Systems (NeurIPS). 2025

2025
[51]

Gemma 3 Technical Report

Gemma Team, A. Kamath, J. Ferret, S. Pathak, N. Vieillard, R. Merhej, S. Perrin, T. Matejovicova, A. Ramé, et al. Gemma 3 technical report.arXiv preprint arXiv:2503.19786, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[52]

Gemma 2: Improving Open Language Models at a Practical Size

Gemma Team, M. Riviere, S. Pathak, P. G. Sessa, C. Hardin, S. Bhupatiraju, L. Hussenot, T. Mesnard, B. Shahriari, et al. Gemma 2: Improving open language models at a practical size.arXiv preprint arXiv:2408.00118, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[53]

Memory-Efficient LLM Pretraining via Minimalist Optimizer Design

A. Glentis, J. Li, A. Han, and M. Hong. A minimalist optimizer design for LLM pretraining.arXiv preprint arXiv:2506.16659, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[54]

GLM-4.5 Team, A. Zeng, X. Lv, Q. Zheng, Z. Hou, B. Chen, C. Xie, C. Wang, D. Yin, et al. GLM-4.5: Agentic, reasoning, and coding (ARC) foundation models.arXiv preprint arXiv:2508.06471, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[55]

GLM-5 Team, A. Zeng, X. Lv, Z. Hou, Z. Du, Q. Zheng, B. Chen, D. Yin, C. Ge, et al. GLM-5: from vibe coding to agentic engineering.arXiv preprint arXiv:2602.15763, 2026. 35

work page internal anchor Pith review Pith/arXiv arXiv 2026
[56]

Goldfarb, Y

D. Goldfarb, Y. Ren, and A. Bahamou. Practical quasi-Newton methods for training deep neural networks. InAdvances in Neural Information Processing Systems (NeurIPS). 2020

2020
[57]

S. Gong, S. Agarwal, Y. Zhang, J. Ye, L. Zheng, M. Li, C. An, P. Zhao, W. Bi, et al. Scaling diffusion language models via adaptation from autoregressive models. InInternational Conference on Learning Representations (ICLR). 2025

2025
[58]

W. Gong, J. Zazo, Q. Luo, P. Wang, J. Hensman, and C. Ma. ARO: A new lens on matrix optimization for large models.arXiv preprint arXiv:2602.09006, 2026

work page arXiv 2026
[59]

Gonon, A.-A

A. Gonon, A.-A. Muşat, and N. Boumal. Insights on Muon from simple quadratics.arXiv preprint arXiv:2602.11948, 2026

work page arXiv 2026
[60]

Gemma 4 model card

Google DeepMind. Gemma 4 model card. 2026

2026
[61]

Grishina, M

E. Grishina, M. Smirnov, and M. Rakhuba. Accelerating Newton-Schulz iteration for orthogonalization via Chebyshev-type polynomials.arXiv preprint arXiv:2506.10935, 2025

work page arXiv 2025
[62]

Gu and T

A. Gu and T. Dao. Mamba: Linear-time sequence modeling with selective state spaces. InProceedings of the Conference on Language Modeling (COLM). 2024

2024
[63]

Gu and Z

Y. Gu and Z. Xie. Mano: Restriking manifold optimization for LLM training.arXiv preprint arXiv:2601.23000, 2026

work page arXiv 2026
[64]

Gupta, T

V. Gupta, T. Koren, and Y. Singer. Shampoo: Preconditioned stochastic tensor optimization. In Proceedings of the International Conference on Machine Learning (ICML). 2018

2018
[65]

K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). 2016

2016
[66]

N. J. Higham. Computing the polar decomposition—with applications.SIAM Journal on Scientific and Statistical Computing, 7(4):1160–1174, 1986

1986
[67]

N. J. Higham. Stable iterations for the matrix square root.Numerical Algorithms, 15(2):227–242, 1997

1997
[68]

N. J. Higham.Functions of Matrices: Theory and Computation. Society for Industrial and Applied Mathematics, 2008

2008
[69]

Hoffmann, S

J. Hoffmann, S. Borgeaud, A. Mensch, E. Buchatskaya, T. Cai, E. Rutherford, D. de las Casas, L. A. Hendricks, J. Welbl, et al. Training compute-optimal large language models. InAdvances in Neural Information Processing Systems (NeurIPS). 2022

2022
[70]

R. A. Horn and C. R. Johnson.Topics in Matrix Analysis. Cambridge University Press, 1994

1994
[71]

R. A. Horn and C. R. Johnson.Matrix Analysis. Cambridge University Press, 2nd edition, 2012

2012
[72]

Y. Hu, H. Song, J. Deng, J. Wang, J. Chen, K. Zhou, Y. Zhu, J. Jiang, Z. Dong, et al. YuLan-Mini: Pushing the limits of open data-efficient language model. InProceedings of the Annual Meeting of the Association for Computational Linguistics (ACL) (Volume 1: Long Papers). 2025

2025
[73]

LiMuon: Light and Fast Muon Optimizer for Large Models

F. Huang, Y. Luo, and S. Chen. LiMuon: Light and fast Muon optimizer for large models.arXiv preprint arXiv:2509.14562, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[74]

A. Q. Jiang, A. Sablayrolles, A. Roux, A. Mensch, B. Savary, C. Bamford, D. S. Chaplot, D. de las Casas, E. B. Hanna, et al. Mixtral of experts.arXiv preprint arXiv:2401.04088, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[75]

Jiang, Z

R. Jiang, Z. Mhammedi, M. Mohri, and A. Mokhtari. Adaptive matrix online learning through smoothing with guarantees for nonsmooth nonconvex optimization.arXiv preprint arXiv:2602.08232, 2026. 36

work page arXiv 2026
[76]

Jiang, J

Z. Jiang, J. Gu, H. Zhu, and D. Pan. Pre-RMSNorm and Pre-CRMSNorm transformers: equivalent and efficient Pre-LN transformers. InAdvances in Neural Information Processing Systems (NeurIPS). 2023

2023
[77]

Jordan, J

K. Jordan, J. Bernstein, B. Rappazzo, @fernbear.bsky.social, B. Vlado, Y. Jiacheng, F. Cesista, B. Koszarsky, and @Grad62304977.modded-nanogpt: Speedrunning the NanoGPT baseline. 2024

2024
[78]

Jordan, Y

K. Jordan, Y. Jin, V. Boza, Y. Jiacheng, F. Cecista, L. Newhouse, and J. Bernstein. Muon: An optimizer for hidden layers in neural networks. 2024

2024
[79]

Scaling Laws for Neural Language Models

J. Kaplan, S. McCandlish, T. Henighan, T. B. Brown, B. Chess, R. Child, S. Gray, A. Radford, J. Wu, et al. Scaling laws for neural language models.arXiv preprint arXiv:2001.08361, 2020

work page internal anchor Pith review Pith/arXiv arXiv 2001
[80]

Kasimbeg, F

P. Kasimbeg, F. Schneider, R. Eschenhagen, J. Bae, C. S. Sastry, M. Saroufim, B. Feng, L. Wright, E. Z. Yang, et al. Accelerating neural network training: An analysis of the AlgoPerf competition. In International Conference on Learning Representations (ICLR). 2025

2025

Showing first 80 references.

[1] [1]

Abbe and E

E. Abbe and E. Boix-Adsera. On the non-universality of deep learning: quantifying the cost of symmetry. InAdvances in Neural Information Processing Systems (NeurIPS). 2022

2022

[2] [2]

K. Ahn, N. Amsel, and J. Langford. Dion2: A simple method to shrink matrix in Muon.arXiv preprint 2512.16928, 2025

work page arXiv 2025

[3] [3]

K. Ahn, B. Xu, N. Abreu, Y. Fan, G. Magakyan, P. Sharma, Z. Zhan, and J. Langford. Dion: Distributed orthonormalized updates.arXiv preprint arXiv:2504.05295, 2025

work page arXiv 2025

[4] [4]

Ainslie, J

J. Ainslie, J. Lee-Thorp, M. De Jong, Y. Zemlyanskiy, F. Lebrón, and S. Sanghai. GQA: Training gen- eralized multi-query transformer models from multi-head checkpoints. InProceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP). 2023

2023

[5] [5]

Amsel, D

N. Amsel, D. Persson, C. Musco, and R. M. Gower. The Polar Express: Optimal matrix sign methods and their application to the Muon algorithm. InInternational Conference on Learning Representations (ICLR). 2026

2026

[6] [6]

K. An, Y. Liu, R. Pan, S. Ma, D. Goldfarb, and T. Zhang. ASGO: Adaptive structured gradient optimization. InAdvances in Neural Information Processing Systems (NeurIPS). 2025

2025

[7] [7]

R. Anil, V. Gupta, T. Koren, K. Regan, and Y. Singer. Scalable second order optimization for deep learning.arXiv preprint arXiv:2002.09018, 2020

work page arXiv 2002

[8] [8]

Anthony, Y

Q. Anthony, Y. Tokpanov, S. Szot, S. Rajagopal, P. Medepalli, R. Iyer, V. Shyam, A. Golubeva, A. Chaurasia, et al. Training foundation models on a full-stack AMD platform: Compute, networking, and system design.arXiv preprint arXiv:2511.17127, 2025

work page arXiv 2025

[9] [9]

L. Autonne. Sur les groupes linéaires, réels et orthogonaux.Bulletin de la Société Mathématique de France, 30:121–134, 1902

1902

[10] [10]

E. Bao, J. Lu, L. Song, N. Hart-Hodgson, W. Parson, and Y. Zhou. Equivariant neural networks and equivarification.arXiv preprint arXiv:1906.07172, 2019

work page arXiv 1906

[11] [11]

Bernstein

J. Bernstein. Deriving Muon. 2025

2025

[12] [12]

Bernstein

J. Bernstein. Modular manifolds.Thinking Machines Lab: Connectionism, 2025. https:// thinkingmachines.ai/blog/modular-manifolds/

2025

[13] [13]

Bernstein and L

J. Bernstein and L. Newhouse. Old optimizer, new norm: An anthology. InOPT 2024: Optimization for Machine Learning. 2024

2024

[14] [14]

Bernstein and L

J. Bernstein and L. Newhouse. Modular duality in deep learning. InProceedings of the International Conference on Machine Learning (ICML). 2025

2025

[15] [15]

Bernstein, Y.-X

J. Bernstein, Y.-X. Wang, K. Azizzadenesheli, and A. Anandkumar. signSGD: Compressed optimisation for non-convex problems. InProceedings of the International Conference on Machine Learning (ICML). 2018

2018

[16] [16]

Bhatia.Matrix Analysis, volume 169

R. Bhatia.Matrix Analysis, volume 169. Springer Science & Business Media, 2013

2013

[17] [17]

Boissin, T

T. Boissin, T. Massena, F. Mamalet, and M. Serrurier. Turbo-Muon: Accelerating orthogonality-based optimization with pre-conditioning.arXiv preprint arXiv:2512.04632, 2025

work page arXiv 2025

[18] [18]

An Introduction to Vision-Language Modeling,

F. Bordes, R. Y. Pang, A. Ajay, A. C. Li, A. Bardes, S. Petryk, O. Mañas, Z. Lin, A. Mahmoud, et al. An introduction to vision-language modeling.arXiv preprint arXiv:2405.17247, 2024. 33

work page arXiv 2024

[19] [19]

T. B. Brown, B. Mann, N. Ryder, M. Subbiah, J. Kaplan, P. Dhariwal, A. Neelakantan, P. Shyam, G. Sastry, et al. Language models are few-shot learners. InAdvances in Neural Information Processing Systems (NeurIPS). 2020

2020

[20] [20]

Buchanan

S. Buchanan. A faster manifold Muon with ADMM.https://sdbuchanan.com/blog/manifold-muon/, 2025

2025

[21] [21]

Carlson, V

D. Carlson, V. Cevher, and L. Carin. Stochastic spectral descent for restricted Boltzmann machines. In Proceedings of the International Conference on Artificial Intelligence and Statistics (AISTATS). 2015

2015

[22] [22]

Carlson, E

D. Carlson, E. Collins, Y.-P. Hsieh, L. Carin, and V. Cevher. Preconditioned spectral descent for deep learning. InAdvances in Neural Information Processing Systems (NeurIPS). 2015

2015

[23] [23]

Carlson, Y.-P

D. Carlson, Y.-P. Hsieh, E. Collins, L. Carin, and V. Cevher. Stochastic spectral descent for discrete graphical models.IEEE Journal of Selected Topics in Signal Processing, 10(2):296–311, 2016

2016

[24] [24]

On the Convergence of Muon and Beyond

D. Chang, Y. Liu, and G. Yuan. On the convergence of Muon and beyond.arXiv preprint arXiv:2509.15816, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[25] [25]

MuonEq: Balancing Before Orthogonalization with Lightweight Equilibration

D. Chang, Q. Shi, L. Zhang, Y. Li, R. Zhang, Y. Lu, Y. Liu, and G. Yuan. MuonEq: Balancing before orthogonalization with lightweight equilibration.arXiv preprint arXiv:2603.28254, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026

[26] [26]

L. Chen, J. Li, and Q. Liu. Muon optimizes under spectral norm constraints.arXiv preprint arXiv:2506.15054, 2025

work page arXiv 2025

[27] [27]

Y. Chen, Y. Chi, J. Fan, and C. Ma. Spectral methods for data science: A statistical perspective. Foundations and Trends®in Machine Learning, 14(5):566–806, 2021

2021

[28] [28]

Chowdhery, S

A. Chowdhery, S. Narang, J. Devlin, M. Bosma, G. Mishra, A. Roberts, P. Barham, H. W. Chung, C. Sutton, et al. PaLM: Scaling language modeling with pathways.Journal of Machine Learning Research, 24(240):1–113, 2023

2023

[29] [29]

Crawshaw, C

M. Crawshaw, C. Modi, M. Liu, and R. M. Gower. An exploration of non-Euclidean gradient descent: Muon and its many variants.arXiv preprint arXiv:2510.09827, 2025

work page arXiv 2025

[30] [30]

G. E. Dahl, F. Schneider, Z. Nado, N. Agarwal, C. S. Sastry, P. Hennig, S. Medapati, R. Eschenhagen, P. Kasimbeg, et al. Benchmarking neural network training algorithms.arXiv preprint arXiv:2306.07179, 2023

work page arXiv 2023

[31] [31]

Dao and A

T. Dao and A. Gu. Transformers are SSMs: Generalized models and efficient algorithms through structured state space duality. InProceedings of the International Conference on Machine Learning (ICML). 2024

2024

[32] [32]

Y. N. Dauphin, A. Fan, M. Auli, and D. Grangier. Language modeling with gated convolutional networks. InProceedings of the International Conference on Machine Learning (ICML). 2017

2017

[33] [33]

Davis and D

D. Davis and D. Drusvyatskiy. When do spectral gradient updates help in deep learning?arXiv preprint arXiv:2512.04299, 2025

work page arXiv 2025

[34] [34]

DeepSeek-V4: Towards highly efficient million-token context intelligence

DeepSeek-AI. DeepSeek-V4: Towards highly efficient million-token context intelligence. 2026

2026

[35] [35]

DeepSeek-AI, A. Liu, B. Feng, B. Xue, B. Wang, B. Wu, C. Lu, C. Zhao, C. Deng, et al. DeepSeek-V3 technical report.arXiv preprint arXiv:2412.19437, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[36] [36]

Dehghani, J

M. Dehghani, J. Djolonga, B. Mustafa, P. Padlewski, J. Heek, J. Gilmer, A. Steiner, M. Caron, R. Geirhos, et al. Scaling vision transformers to 22 billion parameters. InProceedings of the International Conference on Machine Learning (ICML). 2023. 34

2023

[37] [37]

S. Deng, Z. Ouyang, T. Pang, Z. Liu, R. Jin, S. Yu, and Y. Yang. RMNP: Row-momentum normalized preconditioning for scalable matrix-based optimization.arXiv preprint arXiv:2603.20527, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026

[38] [38]

Dewulf, D

A. Dewulf, D. Pai, L. Yang, A. Zhang, and B. Keigwin. Aurora: A leverage-aware optimizer for rectangular matrices. 2026

2026

[39] [39]

C. Ding, D. Sun, J. Sun, and K.-C. Toh. Spectral operators of matrices.Mathematical Programming, 168(1):509–531, 2018

2018

[40] [40]

C. Ding, D. Sun, J. Sun, and K.-C. Toh. Spectral operators of matrices: Semismoothness and characterizations of the generalized Jacobian.SIAM Journal on Optimization, 30(1):630–659, 2020

2020

[41] [41]

Dosovitskiy, L

A. Dosovitskiy, L. Beyer, A. Kolesnikov, D. Weissenborn, X. Zhai, T. Unterthiner, M. Dehghani, M. Minderer, G. Heigold, et al. An image is worth 16x16 words: Transformers for image recognition at scale. InInternational Conference on Learning Representations (ICLR). 2021

2021

[42] [42]

Z. Du, H. He, and W. Su. Uncovering symmetry transfer in large language models via layer-peeled optimization.arXiv preprint arXiv:2605.12756, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026

[43] [43]

Du and W

Z. Du and W. Su. The Newton–Muon optimizer.arXiv preprint arXiv:2604.01472, 2026

work page arXiv 2026

[44] [44]

Duchi, E

J. Duchi, E. Hazan, and Y. Singer. Adaptive subgradient methods for online learning and stochastic optimization.Journal of Machine Learning Research, 12:2121–2159, 2011

2011

[45] [45]

Eschenhagen, A

R. Eschenhagen, A. Cai, T.-H. Lee, and H.-J. M. Shi. Clarifying Shampoo: Adapting spectral descent to stochasticity and the parameter trajectory.arXiv preprint arXiv:2602.09314, 2026

work page arXiv 2026

[46] [46]

Eschenhagen, A

R. Eschenhagen, A. Immer, R. Turner, F. Schneider, and P. Hennig. Kronecker-factored approximate curvature for modern neural network architectures. InAdvances in Neural Information Processing Systems (NeurIPS). 2023

2023

[47] [47]

Layer sharding for large-scale training with Muon.https://www.essential.ai/research/ infra, 2025

Essential AI. Layer sharding for large-scale training with Muon.https://www.essential.ai/research/ infra, 2025

2025

[48] [48]

Essential AI, I. Shah, A. M. Polloreno, K. Stratos, P. Monk, A. Chaluvaraju, A. Hojel, A. Ma, A. Thomas, et al. Practical efficiency of Muon for pretraining.arXiv preprint arXiv:2505.02222, 2025

work page arXiv 2025

[49] [49]

Filatov, J

O. Filatov, J. Wang, J. Ebert, and S. Kesselheim. Optimal scaling needs optimal norm.arXiv preprint arXiv:2510.03871, 2025

work page arXiv 2025

[50] [50]

Frans, S

K. Frans, S. Levine, and P. Abbeel. A stable whitening optimizer for efficient neural network training. InAdvances in Neural Information Processing Systems (NeurIPS). 2025

2025

[51] [51]

Gemma 3 Technical Report

Gemma Team, A. Kamath, J. Ferret, S. Pathak, N. Vieillard, R. Merhej, S. Perrin, T. Matejovicova, A. Ramé, et al. Gemma 3 technical report.arXiv preprint arXiv:2503.19786, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[52] [52]

Gemma 2: Improving Open Language Models at a Practical Size

Gemma Team, M. Riviere, S. Pathak, P. G. Sessa, C. Hardin, S. Bhupatiraju, L. Hussenot, T. Mesnard, B. Shahriari, et al. Gemma 2: Improving open language models at a practical size.arXiv preprint arXiv:2408.00118, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[53] [53]

Memory-Efficient LLM Pretraining via Minimalist Optimizer Design

A. Glentis, J. Li, A. Han, and M. Hong. A minimalist optimizer design for LLM pretraining.arXiv preprint arXiv:2506.16659, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[54] [54]

GLM-4.5 Team, A. Zeng, X. Lv, Q. Zheng, Z. Hou, B. Chen, C. Xie, C. Wang, D. Yin, et al. GLM-4.5: Agentic, reasoning, and coding (ARC) foundation models.arXiv preprint arXiv:2508.06471, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[55] [55]

GLM-5 Team, A. Zeng, X. Lv, Z. Hou, Z. Du, Q. Zheng, B. Chen, D. Yin, C. Ge, et al. GLM-5: from vibe coding to agentic engineering.arXiv preprint arXiv:2602.15763, 2026. 35

work page internal anchor Pith review Pith/arXiv arXiv 2026

[56] [56]

Goldfarb, Y

D. Goldfarb, Y. Ren, and A. Bahamou. Practical quasi-Newton methods for training deep neural networks. InAdvances in Neural Information Processing Systems (NeurIPS). 2020

2020

[57] [57]

S. Gong, S. Agarwal, Y. Zhang, J. Ye, L. Zheng, M. Li, C. An, P. Zhao, W. Bi, et al. Scaling diffusion language models via adaptation from autoregressive models. InInternational Conference on Learning Representations (ICLR). 2025

2025

[58] [58]

W. Gong, J. Zazo, Q. Luo, P. Wang, J. Hensman, and C. Ma. ARO: A new lens on matrix optimization for large models.arXiv preprint arXiv:2602.09006, 2026

work page arXiv 2026

[59] [59]

Gonon, A.-A

A. Gonon, A.-A. Muşat, and N. Boumal. Insights on Muon from simple quadratics.arXiv preprint arXiv:2602.11948, 2026

work page arXiv 2026

[60] [60]

Gemma 4 model card

Google DeepMind. Gemma 4 model card. 2026

2026

[61] [61]

Grishina, M

E. Grishina, M. Smirnov, and M. Rakhuba. Accelerating Newton-Schulz iteration for orthogonalization via Chebyshev-type polynomials.arXiv preprint arXiv:2506.10935, 2025

work page arXiv 2025

[62] [62]

Gu and T

A. Gu and T. Dao. Mamba: Linear-time sequence modeling with selective state spaces. InProceedings of the Conference on Language Modeling (COLM). 2024

2024

[63] [63]

Gu and Z

Y. Gu and Z. Xie. Mano: Restriking manifold optimization for LLM training.arXiv preprint arXiv:2601.23000, 2026

work page arXiv 2026

[64] [64]

Gupta, T

V. Gupta, T. Koren, and Y. Singer. Shampoo: Preconditioned stochastic tensor optimization. In Proceedings of the International Conference on Machine Learning (ICML). 2018

2018

[65] [65]

K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). 2016

2016

[66] [66]

N. J. Higham. Computing the polar decomposition—with applications.SIAM Journal on Scientific and Statistical Computing, 7(4):1160–1174, 1986

1986

[67] [67]

N. J. Higham. Stable iterations for the matrix square root.Numerical Algorithms, 15(2):227–242, 1997

1997

[68] [68]

N. J. Higham.Functions of Matrices: Theory and Computation. Society for Industrial and Applied Mathematics, 2008

2008

[69] [69]

Hoffmann, S

J. Hoffmann, S. Borgeaud, A. Mensch, E. Buchatskaya, T. Cai, E. Rutherford, D. de las Casas, L. A. Hendricks, J. Welbl, et al. Training compute-optimal large language models. InAdvances in Neural Information Processing Systems (NeurIPS). 2022

2022

[70] [70]

R. A. Horn and C. R. Johnson.Topics in Matrix Analysis. Cambridge University Press, 1994

1994

[71] [71]

R. A. Horn and C. R. Johnson.Matrix Analysis. Cambridge University Press, 2nd edition, 2012

2012

[72] [72]

Y. Hu, H. Song, J. Deng, J. Wang, J. Chen, K. Zhou, Y. Zhu, J. Jiang, Z. Dong, et al. YuLan-Mini: Pushing the limits of open data-efficient language model. InProceedings of the Annual Meeting of the Association for Computational Linguistics (ACL) (Volume 1: Long Papers). 2025

2025

[73] [73]

LiMuon: Light and Fast Muon Optimizer for Large Models

F. Huang, Y. Luo, and S. Chen. LiMuon: Light and fast Muon optimizer for large models.arXiv preprint arXiv:2509.14562, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[74] [74]

A. Q. Jiang, A. Sablayrolles, A. Roux, A. Mensch, B. Savary, C. Bamford, D. S. Chaplot, D. de las Casas, E. B. Hanna, et al. Mixtral of experts.arXiv preprint arXiv:2401.04088, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[75] [75]

Jiang, Z

R. Jiang, Z. Mhammedi, M. Mohri, and A. Mokhtari. Adaptive matrix online learning through smoothing with guarantees for nonsmooth nonconvex optimization.arXiv preprint arXiv:2602.08232, 2026. 36

work page arXiv 2026

[76] [76]

Jiang, J

Z. Jiang, J. Gu, H. Zhu, and D. Pan. Pre-RMSNorm and Pre-CRMSNorm transformers: equivalent and efficient Pre-LN transformers. InAdvances in Neural Information Processing Systems (NeurIPS). 2023

2023

[77] [77]

Jordan, J

K. Jordan, J. Bernstein, B. Rappazzo, @fernbear.bsky.social, B. Vlado, Y. Jiacheng, F. Cesista, B. Koszarsky, and @Grad62304977.modded-nanogpt: Speedrunning the NanoGPT baseline. 2024

2024

[78] [78]

Jordan, Y

K. Jordan, Y. Jin, V. Boza, Y. Jiacheng, F. Cecista, L. Newhouse, and J. Bernstein. Muon: An optimizer for hidden layers in neural networks. 2024

2024

[79] [79]

Scaling Laws for Neural Language Models

J. Kaplan, S. McCandlish, T. Henighan, T. B. Brown, B. Chess, R. Child, S. Gray, A. Radford, J. Wu, et al. Scaling laws for neural language models.arXiv preprint arXiv:2001.08361, 2020

work page internal anchor Pith review Pith/arXiv arXiv 2001

[80] [80]

Kasimbeg, F

P. Kasimbeg, F. Schneider, R. Eschenhagen, J. Bae, C. S. Sastry, M. Saroufim, B. Feng, L. Wright, E. Z. Yang, et al. Accelerating neural network training: An analysis of the AlgoPerf competition. In International Conference on Learning Representations (ICLR). 2025

2025