Symmetry-Compatible Principle for Optimizer Design: Embeddings, LM Heads, SwiGLU MLPs, and MoE Routers

Tim Tsz-Kit Lau; Weijie Su

arxiv: 2605.18106 · v1 · pith:AGW62PWNnew · submitted 2026-05-18 · 🧮 math.OC · cs.AI· cs.LG· stat.ML

Symmetry-Compatible Principle for Optimizer Design: Embeddings, LM Heads, SwiGLU MLPs, and MoE Routers

Tim Tsz-Kit Lau , Weijie Su This is my paper

Pith reviewed 2026-05-20 09:30 UTC · model grok-4.3

classification 🧮 math.OC cs.AIcs.LGstat.ML

keywords symmetry-compatible optimizerequivariant updatesembedding matricesMoE routersSwiGLU projectionslanguage model pretrainingpermutation symmetryAdamW alternatives

0 comments

The pith

Gradient updates should be equivariant under the symmetry group of each weight block to improve training.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces a symmetry-compatible principle stating that the gradient update rule must respect the equivariance properties of the parameter space in neural networks. This leads to tailored optimizers for blocks with permutation symmetries, such as embeddings and LM heads, and shared-shift symmetries, such as MoE routers, along with hybrid rules for SwiGLU projections. The authors assemble these into a full layerwise optimizer stack and test it in pre-training of both dense and sparse MoE language models. Experiments show consistent reductions in final validation loss compared with AdamW, plus stability gains in some runs. A sympathetic reader would see this as closing a long-standing mismatch between architectural geometry and coordinate-wise optimization practice.

Core claim

The central claim is that the gradient update rule should be equivariant under the symmetry group acting on the corresponding weight block, yielding one-sided spectral, row-norm, hybrid row-norm/spectral, row-aware, column-aware, centered row-norm, and left-spectral updates for embeddings, LM heads, SwiGLU MLPs, and MoE routers that together outperform standard AdamW updates.

What carries the argument

The symmetry-compatible principle, requiring that the update rule be equivariant under the symmetry group of each weight block.

If this is right

Symmetry-compatible updates improve final validation loss over AdamW across dense and MoE language models.
The updates also improve training stability in several of the reported experiments.
The constructions form a complete end-to-end layerwise optimizer stack that assigns each major matrix parameter class an update matching its symmetry group.
The new rules include row-norm, one-sided spectral, and hybrid variants derived from permutation and shared-shift symmetries.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same principle could be applied to other layer types whose symmetries have not yet been catalogued.
Architectures might eventually be co-designed so that their parameter symmetries align more cleanly with efficient equivariant updates.
Checking whether the loss gains persist or reverse when models are scaled by another order of magnitude would test the robustness of the reported improvements.

Load-bearing premise

That enforcing equivariance under the identified symmetry groups for these parameter blocks is sufficient to produce better optimizers without hidden costs that only appear at larger scale.

What would settle it

A controlled pre-training run on any of the tested model families in which the symmetry-compatible stack produces equal or higher validation loss than AdamW at the same compute budget would falsify the practical advantage.

Figures

Figures reproduced from arXiv: 2605.18106 by Tim Tsz-Kit Lau, Weijie Su.

**Figure 2.** Figure 2: Validation losses for downsized gpt-oss pre-training. The configurations differ in the opti [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗

**Figure 3.** Figure 3: Training and validation losses for Qwen3-0.6B-style pre-training. In each subfigure, the [PITH_FULL_IMAGE:figures/full_fig_p022_3.png] view at source ↗

**Figure 4.** Figure 4: Training and validation losses for Gemma 3 1B-style pre-training. In each subfigure, the [PITH_FULL_IMAGE:figures/full_fig_p023_4.png] view at source ↗

**Figure 5.** Figure 5: Training and validation losses for OLMoE-1B-7B-style pre-training. The configurations differ in the optimizers assigned to the embedding, LM head, and router matrices. The final validation losses for configurations (i)–(iv) are 4.0814, 4.0717, 4.1083, and 4.1155 respectively. As shown in [PITH_FULL_IMAGE:figures/full_fig_p025_5.png] view at source ↗

**Figure 6.** Figure 6: Training and validation losses for downsized gpt-oss pre-training. The configurations [PITH_FULL_IMAGE:figures/full_fig_p026_6.png] view at source ↗

read the original abstract

A striking geometric disparity has long persisted in the practice of deep learning. While modern neural network architectures naturally exhibit rich symmetry and equivariance properties, popular optimizers such as Adam and its variants operate inherently coordinate-wise, rendering them unable to respect the equivariance structures of the parameter space. We address this disparity by introducing a symmetry-compatible principle for optimizer design: the gradient update rule should be equivariant under the symmetry group acting on the corresponding weight block. Following this principle, we first provide a unified perspective on bi-orthogonally equivariant updates for general matrix layers, as employed by stochastic spectral descent, Muon, Scion, and polar gradient methods. More importantly, by moving from orthogonal groups to permutation and shared-shift symmetries, we derive symmetry-compatible optimizers for parameter blocks whose symmetries differ from those of general matrix layers: embedding and LM head matrices, SwiGLU MLP projections, and MoE router matrices. These constructions include one-sided spectral, row-norm, hybrid row-norm/spectral, row-aware, column-aware, centered row-norm, and left-spectral updates. They yield an end-to-end layerwise optimizer stack in which each major matrix-valued parameter class is assigned an update whose equivariance matches its symmetry group. We corroborate this principle through pre-training experiments on dense and sparse MoE language models, including Qwen3-0.6B-style, Gemma 3 1B-style, OLMoE-1B-7B-style, and downsized gpt-oss architectures. Across these experiments, symmetry-compatible updates consistently improve final validation loss, and in several cases training stability, over corresponding AdamW updates.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The symmetry principle gives a clean way to build equivariant updates for embeddings and routers, but the gains over AdamW could easily trace to unmatched step sizes rather than the equivariance itself.

read the letter

The main takeaway is that the authors extend spectral-style ideas to permutation symmetries in embeddings and shared-shift symmetries in MoE routers, producing concrete updates like row-norm, one-sided spectral, and hybrid versions that differ from plain AdamW. They assemble these into a full layerwise stack and report better validation loss on a handful of small pretraining runs (Qwen3-0.6B style, Gemma-3-1B, OLMoE, downsized gpt-oss). That direction is positive and the constructions are explicit enough to implement.

Referee Report

2 major / 2 minor

Summary. The paper proposes a symmetry-compatible principle for optimizer design: gradient updates must be equivariant under the symmetry group acting on each weight block. It unifies bi-orthogonally equivariant methods for matrix layers and derives new updates (row-norm, one-sided spectral, hybrid row-norm/spectral, centered row-norm, left-spectral, row-aware, column-aware) for blocks with permutation symmetries (embeddings, LM heads) and shared-shift symmetries (MoE routers, SwiGLU projections). These are assembled into a layerwise optimizer stack and tested via pre-training runs on Qwen3-0.6B-style, Gemma-3-1B-style, OLMoE-1B-7B-style, and downsized gpt-oss models, where the symmetry-compatible updates are reported to yield lower final validation loss and occasional stability gains relative to AdamW.

Significance. If the performance gains are shown to arise specifically from the enforced equivariance (rather than incidental rescaling), the principle supplies a systematic route to architecture-aware optimizers that respect the permutation and shift structures already present in modern LLM components. The multi-architecture experimental sweep and the unified treatment of bi-orthogonal updates constitute concrete strengths that would remain valuable even if the central empirical claim requires additional controls.

major comments (2)

[Experimental section] Experimental section: the manuscript reports consistent validation-loss improvements for the symmetry-compatible updates (row-norm, one-sided spectral, hybrid, centered row-norm, left-spectral) over AdamW across Qwen3-0.6B, Gemma-3-1B, OLMoE and gpt-oss runs, yet provides no per-method learning-rate sweeps or explicit matching of mean update norms / effective step lengths. Because the proposed rules embed explicit row/column normalizations and spectral projections absent from coordinate-wise AdamW, the observed gains could be produced by an implicit change in gradient magnitude rather than by permutation or shared-shift equivariance. This control is load-bearing for the central claim.
[§3 (Derivations of symmetry-compatible updates)] §3 (Derivations of symmetry-compatible updates): the principle is introduced as an independent design axiom whose justification rests on external validation by the loss curves. It would strengthen the manuscript to demonstrate, even for a single block, that the derived equivariant update cannot be recovered from AdamW by a simple global rescaling or reparameterization of the learning rate.

minor comments (2)

[Abstract] Abstract and experimental summary: the claim of 'consistent loss improvements' and 'several cases' of improved stability would be more informative if accompanied by quantitative effect sizes, number of independent runs, and standard deviations.
[Notation and definitions] Notation: the precise definitions of 'one-sided spectral', 'hybrid row-norm/spectral', and 'left-spectral' updates should be stated with explicit matrix formulas in the main text rather than left to supplementary material.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the detailed and constructive report. We agree that isolating the role of equivariance from possible rescaling effects is essential for the central claim, and we outline revisions below to address both major comments.

read point-by-point responses

Referee: [Experimental section] Experimental section: the manuscript reports consistent validation-loss improvements for the symmetry-compatible updates (row-norm, one-sided spectral, hybrid, centered row-norm, left-spectral) over AdamW across Qwen3-0.6B, Gemma-3-1B, OLMoE and gpt-oss runs, yet provides no per-method learning-rate sweeps or explicit matching of mean update norms / effective step lengths. Because the proposed rules embed explicit row/column normalizations and spectral projections absent from coordinate-wise AdamW, the observed gains could be produced by an implicit change in gradient magnitude rather than by permutation or shared-shift equivariance. This control is load-bearing for the central claim.

Authors: We acknowledge that the current experiments lack per-method learning-rate sweeps and explicit matching of mean update norms. The symmetry-compatible rules do introduce row/column normalizations and spectral projections that alter effective magnitudes in a structured manner, so the referee's concern is valid. In the revised manuscript we will add experiments that tune learning rates separately for each method and explicitly match mean update norms (or effective step lengths) to AdamW baselines. This will allow direct comparison under comparable gradient magnitudes and thereby strengthen evidence that gains arise from permutation and shared-shift equivariance. revision: yes
Referee: [§3 (Derivations of symmetry-compatible updates)] §3 (Derivations of symmetry-compatible updates): the principle is introduced as an independent design axiom whose justification rests on external validation by the loss curves. It would strengthen the manuscript to demonstrate, even for a single block, that the derived equivariant update cannot be recovered from AdamW by a simple global rescaling or reparameterization of the learning rate.

Authors: We agree that an explicit demonstration would reinforce the independence of the derived updates. For the embedding-matrix case we will add a short algebraic comparison in §3 (or an appendix) showing that the row-norm update scales each row by the inverse of its own gradient norm. This per-row, data-dependent scaling cannot be reproduced by any fixed global multiplier applied to an AdamW update, because the latter remains strictly coordinate-wise and lacks the row-wise aggregation required by permutation equivariance. A brief numerical counter-example on a small matrix will be included to illustrate that no single rescaling factor equates the two rules. revision: yes

Circularity Check

0 steps flagged

Symmetry principle stated as independent axiom; derivation self-contained with external validation

full rationale

The paper introduces the symmetry-compatible principle as a design axiom (gradient update equivariant under the symmetry group of each weight block) and derives specific update rules for embeddings, LM heads, SwiGLU MLPs, and MoE routers from that axiom plus standard equivariance requirements. No step reduces a claimed prediction or result to a fitted parameter or self-citation by construction. Experiments on Qwen3, Gemma, OLMoE, and gpt-oss models serve as independent validation rather than the source of the updates. The central claim remains independent of the reported loss curves.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the single design axiom that updates must be equivariant under the symmetry group of each weight block, plus the empirical observation that the resulting rules outperform AdamW; no free parameters or new postulated entities are introduced.

axioms (1)

domain assumption The gradient update rule should be equivariant under the symmetry group acting on the corresponding weight block.
This is the load-bearing design principle stated in the abstract and used to derive all subsequent update rules.

pith-pipeline@v0.9.0 · 5853 in / 1248 out tokens · 32650 ms · 2026-05-20T09:30:08.976057+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

176 extracted references · 176 canonical work pages · 23 internal anchors

[1]

Abbe and E

E. Abbe and E. Boix-Adsera. On the non-universality of deep learning: quantifying the cost of symmetry. InAdvances in Neural Information Processing Systems (NeurIPS). 2022

work page 2022
[2]

K. Ahn, N. Amsel, and J. Langford. Dion2: A simple method to shrink matrix in Muon.arXiv preprint 2512.16928, 2025

work page arXiv 2025
[3]

K. Ahn, B. Xu, N. Abreu, Y. Fan, G. Magakyan, P. Sharma, Z. Zhan, and J. Langford. Dion: Distributed orthonormalized updates.arXiv preprint arXiv:2504.05295, 2025

work page arXiv 2025
[4]

Ainslie, J

J. Ainslie, J. Lee-Thorp, M. De Jong, Y. Zemlyanskiy, F. Lebrón, and S. Sanghai. GQA: Training gen- eralized multi-query transformer models from multi-head checkpoints. InProceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP). 2023

work page 2023
[5]

Amsel, D

N. Amsel, D. Persson, C. Musco, and R. M. Gower. The Polar Express: Optimal matrix sign methods and their application to the Muon algorithm. InInternational Conference on Learning Representations (ICLR). 2026

work page 2026
[6]

K. An, Y. Liu, R. Pan, S. Ma, D. Goldfarb, and T. Zhang. ASGO: Adaptive structured gradient optimization. InAdvances in Neural Information Processing Systems (NeurIPS). 2025

work page 2025
[7]

R. Anil, V. Gupta, T. Koren, K. Regan, and Y. Singer. Scalable second order optimization for deep learning.arXiv preprint arXiv:2002.09018, 2020

work page arXiv 2002
[8]

Anthony, Y

Q. Anthony, Y. Tokpanov, S. Szot, S. Rajagopal, P. Medepalli, R. Iyer, V. Shyam, A. Golubeva, A. Chaurasia, et al. Training foundation models on a full-stack AMD platform: Compute, networking, and system design.arXiv preprint arXiv:2511.17127, 2025

work page arXiv 2025
[9]

L. Autonne. Sur les groupes linéaires, réels et orthogonaux.Bulletin de la Société Mathématique de France, 30:121–134, 1902

work page 1902
[10]

E. Bao, J. Lu, L. Song, N. Hart-Hodgson, W. Parson, and Y. Zhou. Equivariant neural networks and equivarification.arXiv preprint arXiv:1906.07172, 2019

work page arXiv 1906
[11]

Bernstein

J. Bernstein. Deriving Muon. 2025

work page 2025
[12]

Bernstein

J. Bernstein. Modular manifolds.Thinking Machines Lab: Connectionism, 2025. https:// thinkingmachines.ai/blog/modular-manifolds/

work page 2025
[13]

Bernstein and L

J. Bernstein and L. Newhouse. Old optimizer, new norm: An anthology. InOPT 2024: Optimization for Machine Learning. 2024

work page 2024
[14]

Bernstein and L

J. Bernstein and L. Newhouse. Modular duality in deep learning. InProceedings of the International Conference on Machine Learning (ICML). 2025

work page 2025
[15]

Bernstein, Y.-X

J. Bernstein, Y.-X. Wang, K. Azizzadenesheli, and A. Anandkumar. signSGD: Compressed optimisation for non-convex problems. InProceedings of the International Conference on Machine Learning (ICML). 2018

work page 2018
[16]

Bhatia.Matrix Analysis, volume 169

R. Bhatia.Matrix Analysis, volume 169. Springer Science & Business Media, 2013

work page 2013
[17]

Boissin, T

T. Boissin, T. Massena, F. Mamalet, and M. Serrurier. Turbo-Muon: Accelerating orthogonality-based optimization with pre-conditioning.arXiv preprint arXiv:2512.04632, 2025

work page arXiv 2025
[18]

Shashank Gupta, Chaitanya Ahuja, Tsung-Yu Lin, Sreya Dutta Ro y, Harrie Oosterhuis, Maarten de Rijke, and Satya Narayan Shukla

F. Bordes, R. Y. Pang, A. Ajay, A. C. Li, A. Bardes, S. Petryk, O. Mañas, Z. Lin, A. Mahmoud, et al. An introduction to vision-language modeling.arXiv preprint arXiv:2405.17247, 2024. 30

work page arXiv 2024
[19]

T. B. Brown, B. Mann, N. Ryder, M. Subbiah, J. Kaplan, P. Dhariwal, A. Neelakantan, P. Shyam, G. Sastry, et al. Language models are few-shot learners. InAdvances in Neural Information Processing Systems (NeurIPS). 2020

work page 2020
[20]

Buchanan

S. Buchanan. A faster manifold Muon with ADMM.https://sdbuchanan.com/blog/manifold-muon/, 2025

work page 2025
[21]

Carlson, V

D. Carlson, V. Cevher, and L. Carin. Stochastic spectral descent for restricted Boltzmann machines. In Proceedings of the International Conference on Artificial Intelligence and Statistics (AISTATS). 2015

work page 2015
[22]

Carlson, E

D. Carlson, E. Collins, Y.-P. Hsieh, L. Carin, and V. Cevher. Preconditioned spectral descent for deep learning. InAdvances in Neural Information Processing Systems (NeurIPS). 2015

work page 2015
[23]

Carlson, Y.-P

D. Carlson, Y.-P. Hsieh, E. Collins, L. Carin, and V. Cevher. Stochastic spectral descent for discrete graphical models.IEEE Journal of Selected Topics in Signal Processing, 10(2):296–311, 2016

work page 2016
[24]

On the Convergence of Muon and Beyond

D. Chang, Y. Liu, and G. Yuan. On the convergence of Muon and beyond.arXiv preprint arXiv:2509.15816, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[25]

MuonEq: Balancing Before Orthogonalization with Lightweight Equilibration

D. Chang, Q. Shi, L. Zhang, Y. Li, R. Zhang, Y. Lu, Y. Liu, and G. Yuan. MuonEq: Balancing before orthogonalization with lightweight equilibration.arXiv preprint arXiv:2603.28254, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026
[26]

L. Chen, J. Li, and Q. Liu. Muon optimizes under spectral norm constraints.arXiv preprint arXiv:2506.15054, 2025

work page arXiv 2025
[27]

Y. Chen, Y. Chi, J. Fan, and C. Ma. Spectral methods for data science: A statistical perspective. Foundations and Trends®in Machine Learning, 14(5):566–806, 2021

work page 2021
[28]

An exploration of non-euclidean gradient descent: Muon and its many variants.arXiv preprint arXiv:2510.09827,

M. Crawshaw, C. Modi, M. Liu, and R. M. Gower. An exploration of non-Euclidean gradient descent: Muon and its many variants.arXiv preprint arXiv:2510.09827, 2025

work page arXiv 2025
[29]

G. E. Dahl, F. Schneider, Z. Nado, N. Agarwal, C. S. Sastry, P. Hennig, S. Medapati, R. Eschenhagen, P. Kasimbeg, et al. Benchmarking neural network training algorithms.arXiv preprint arXiv:2306.07179, 2023

work page arXiv 2023
[30]

Dao and A

T. Dao and A. Gu. Transformers are SSMs: Generalized models and efficient algorithms through structured state space duality. InProceedings of the International Conference on Machine Learning (ICML). 2024

work page 2024
[31]

Y. N. Dauphin, A. Fan, M. Auli, and D. Grangier. Language modeling with gated convolutional networks. InProceedings of the International Conference on Machine Learning (ICML). 2017

work page 2017
[32]

Davis and D

D. Davis and D. Drusvyatskiy. When do spectral gradient updates help in deep learning?arXiv preprint arXiv:2512.04299, 2025

work page arXiv 2025
[33]

DeepSeek-V4: Towards highly efficient million-token context intelligence

DeepSeek-AI. DeepSeek-V4: Towards highly efficient million-token context intelligence. 2026

work page 2026
[34]

DeepSeek-AI, A. Liu, B. Feng, B. Xue, B. Wang, B. Wu, C. Lu, C. Zhao, C. Deng, et al. DeepSeek-V3 technical report.arXiv preprint arXiv:2412.19437, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[35]

Dehghani, J

M. Dehghani, J. Djolonga, B. Mustafa, P. Padlewski, J. Heek, J. Gilmer, A. Steiner, M. Caron, R. Geirhos, et al. Scaling vision transformers to 22 billion parameters. InProceedings of the International Conference on Machine Learning (ICML). 2023

work page 2023
[36]

S. Deng, Z. Ouyang, T. Pang, Z. Liu, R. Jin, S. Yu, and Y. Yang. RMNP: Row-momentum normalized preconditioning for scalable matrix-based optimization.arXiv preprint arXiv:2603.20527, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026
[37]

Dewulf, D

A. Dewulf, D. Pai, L. Yang, A. Zhang, and B. Keigwin. Aurora: A leverage-aware optimizer for rectangular matrices. 2026. 31

work page 2026
[38]

C. Ding, D. Sun, J. Sun, and K.-C. Toh. Spectral operators of matrices.Mathematical Programming, 168(1):509–531, 2018

work page 2018
[39]

C. Ding, D. Sun, J. Sun, and K.-C. Toh. Spectral operators of matrices: Semismoothness and characterizations of the generalized Jacobian.SIAM Journal on Optimization, 30(1):630–659, 2020

work page 2020
[40]

Dosovitskiy, L

A. Dosovitskiy, L. Beyer, A. Kolesnikov, D. Weissenborn, X. Zhai, T. Unterthiner, M. Dehghani, M. Minderer, G. Heigold, et al. An image is worth 16x16 words: Transformers for image recognition at scale. InInternational Conference on Learning Representations (ICLR). 2021

work page 2021
[41]

Z. Du, H. He, and W. Su. Uncovering symmetry transfer in large language models via layer-peeled optimization.arXiv preprint arXiv:2605.12756, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026
[42]

Du and W

Z. Du and W. Su. The Newton–Muon optimizer.arXiv preprint arXiv:2604.01472, 2026

work page arXiv 2026
[43]

Duchi, E

J. Duchi, E. Hazan, and Y. Singer. Adaptive subgradient methods for online learning and stochastic optimization.Journal of Machine Learning Research, 12:2121–2159, 2011

work page 2011
[44]

Eschenhagen, A

R. Eschenhagen, A. Cai, T.-H. Lee, and H.-J. M. Shi. Clarifying Shampoo: Adapting spectral descent to stochasticity and the parameter trajectory.arXiv preprint arXiv:2602.09314, 2026

work page arXiv 2026
[45]

Eschenhagen, A

R. Eschenhagen, A. Immer, R. Turner, F. Schneider, and P. Hennig. Kronecker-factored approximate curvature for modern neural network architectures. InAdvances in Neural Information Processing Systems (NeurIPS). 2023

work page 2023
[46]

Layer sharding for large-scale training with Muon.https://www.essential.ai/research/ infra, 2025

Essential AI. Layer sharding for large-scale training with Muon.https://www.essential.ai/research/ infra, 2025

work page 2025
[47]

Essential AI, I. Shah, A. M. Polloreno, K. Stratos, P. Monk, A. Chaluvaraju, A. Hojel, A. Ma, A. Thomas, et al. Practical efficiency of Muon for pretraining.arXiv preprint arXiv:2505.02222, 2025

work page arXiv 2025
[48]

Filatov, J

O. Filatov, J. Wang, J. Ebert, and S. Kesselheim. Optimal scaling needs optimal norm.arXiv preprint arXiv:2510.03871, 2025

work page arXiv 2025
[49]

Frans, S

K. Frans, S. Levine, and P. Abbeel. A stable whitening optimizer for efficient neural network training. InAdvances in Neural Information Processing Systems (NeurIPS). 2025

work page 2025
[50]

Gemma 3 Technical Report

Gemma Team, A. Kamath, J. Ferret, S. Pathak, N. Vieillard, R. Merhej, S. Perrin, T. Matejovicova, A. Ramé, et al. Gemma 3 technical report.arXiv preprint arXiv:2503.19786, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[51]

Glentis, J

A. Glentis, J. Li, A. Han, and M. Hong. A minimalist optimizer design for LLM pretraining.arXiv preprint arXiv:2506.16659, 2025

work page internal anchor Pith review arXiv 2025
[52]

GLM-4.5 Team, A. Zeng, X. Lv, Q. Zheng, Z. Hou, B. Chen, C. Xie, C. Wang, D. Yin, et al. GLM-4.5: Agentic, reasoning, and coding (ARC) foundation models.arXiv preprint arXiv:2508.06471, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[53]

GLM-5 Team, A. Zeng, X. Lv, Z. Hou, Z. Du, Q. Zheng, B. Chen, D. Yin, C. Ge, et al. GLM-5: from vibe coding to agentic engineering.arXiv preprint arXiv:2602.15763, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026
[54]

Goldfarb, Y

D. Goldfarb, Y. Ren, and A. Bahamou. Practical quasi-Newton methods for training deep neural networks. InAdvances in Neural Information Processing Systems (NeurIPS). 2020

work page 2020
[55]

S. Gong, S. Agarwal, Y. Zhang, J. Ye, L. Zheng, M. Li, C. An, P. Zhao, W. Bi, et al. Scaling diffusion language models via adaptation from autoregressive models. InInternational Conference on Learning Representations (ICLR). 2025

work page 2025
[56]

W. Gong, J. Zazo, Q. Luo, P. Wang, J. Hensman, and C. Ma. ARO: A new lens on matrix optimization for large models.arXiv preprint arXiv:2602.09006, 2026. 32

work page arXiv 2026
[57]

Gonon, A.-A

A. Gonon, A.-A. Muşat, and N. Boumal. Insights on Muon from simple quadratics.arXiv preprint arXiv:2602.11948, 2026

work page arXiv 2026
[58]

Gemma 4 model card

Google DeepMind. Gemma 4 model card. 2026

work page 2026
[59]

Grishina, M

E. Grishina, M. Smirnov, and M. Rakhuba. Accelerating Newton-Schulz iteration for orthogonalization via Chebyshev-type polynomials.arXiv preprint arXiv:2506.10935, 2025

work page arXiv 2025
[60]

Gu and T

A. Gu and T. Dao. Mamba: Linear-time sequence modeling with selective state spaces. InProceedings of the Conference on Language Modeling (COLM). 2024

work page 2024
[61]

Gu and Z

Y. Gu and Z. Xie. Mano: Restriking manifold optimization for LLM training.arXiv preprint arXiv:2601.23000, 2026

work page arXiv 2026
[62]

Gupta, T

V. Gupta, T. Koren, and Y. Singer. Shampoo: Preconditioned stochastic tensor optimization. In Proceedings of the International Conference on Machine Learning (ICML). 2018

work page 2018
[63]

K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). 2016

work page 2016
[64]

N. J. Higham. Computing the polar decomposition—with applications.SIAM Journal on Scientific and Statistical Computing, 7(4):1160–1174, 1986

work page 1986
[65]

N. J. Higham. Stable iterations for the matrix square root.Numerical Algorithms, 15(2):227–242, 1997

work page 1997
[66]

N. J. Higham.Functions of Matrices: Theory and Computation. Society for Industrial and Applied Mathematics, 2008

work page 2008
[67]

Hoffmann, S

J. Hoffmann, S. Borgeaud, A. Mensch, E. Buchatskaya, T. Cai, E. Rutherford, D. de las Casas, L. A. Hendricks, J. Welbl, et al. Training compute-optimal large language models. InAdvances in Neural Information Processing Systems (NeurIPS). 2022

work page 2022
[68]

R. A. Horn and C. R. Johnson.Topics in Matrix Analysis. Cambridge University Press, 1994

work page 1994
[69]

R. A. Horn and C. R. Johnson.Matrix Analysis. Cambridge University Press, 2nd edition, 2012

work page 2012
[70]

Y. Hu, H. Song, J. Deng, J. Wang, J. Chen, K. Zhou, Y. Zhu, J. Jiang, Z. Dong, et al. YuLan-Mini: Pushing the limits of open data-efficient language model. InProceedings of the Annual Meeting of the Association for Computational Linguistics (ACL) (Volume 1: Long Papers). 2025

work page 2025
[71]

Huang, Y

F. Huang, Y. Luo, and S. Chen. LiMuon: Light and fast Muon optimizer for large models.arXiv preprint arXiv:2509.14562, 2025

work page arXiv 2025
[72]

A. Q. Jiang, A. Sablayrolles, A. Roux, A. Mensch, B. Savary, C. Bamford, D. S. Chaplot, D. de las Casas, E. B. Hanna, et al. Mixtral of experts.arXiv preprint arXiv:2401.04088, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[73]

Jiang, Z

R. Jiang, Z. Mhammedi, M. Mohri, and A. Mokhtari. Adaptive matrix online learning through smoothing with guarantees for nonsmooth nonconvex optimization.arXiv preprint arXiv:2602.08232, 2026

work page arXiv 2026
[74]

Jiang, J

Z. Jiang, J. Gu, H. Zhu, and D. Pan. Pre-RMSNorm and Pre-CRMSNorm transformers: equivalent and efficient Pre-LN transformers. InAdvances in Neural Information Processing Systems (NeurIPS). 2023

work page 2023
[75]

Jordan, J

K. Jordan, J. Bernstein, B. Rappazzo, @fernbear.bsky.social, B. Vlado, Y. Jiacheng, F. Cesista, B. Koszarsky, and @Grad62304977.modded-nanogpt: Speedrunning the NanoGPT baseline. 2024

work page 2024
[76]

Jordan, Y

K. Jordan, Y. Jin, V. Boza, Y. Jiacheng, F. Cecista, L. Newhouse, and J. Bernstein. Muon: An optimizer for hidden layers in neural networks. 2024. 33

work page 2024
[77]

Scaling Laws for Neural Language Models

J. Kaplan, S. McCandlish, T. Henighan, T. B. Brown, B. Chess, R. Child, S. Gray, A. Radford, J. Wu, et al. Scaling laws for neural language models.arXiv preprint arXiv:2001.08361, 2020

work page internal anchor Pith review Pith/arXiv arXiv 2001
[78]

Kasimbeg, F

P. Kasimbeg, F. Schneider, R. Eschenhagen, J. Bae, C. S. Sastry, M. Saroufim, B. Feng, L. Wright, E. Z. Yang, et al. Accelerating neural network training: An analysis of the AlgoPerf competition. In International Conference on Learning Representations (ICLR). 2025

work page 2025
[79]

G. Y. Kim and M.-h. Oh. Convergence of Muon with Newton-Schulz. InInternational Conference on Learning Representations (ICLR). 2026

work page 2026
[80]

Kimi K2: Open Agentic Intelligence

Kimi Team. Kimi K2: Open agentic intelligence.arXiv preprint arXiv:2507.20534, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

Showing first 80 references.

[1] [1]

Abbe and E

E. Abbe and E. Boix-Adsera. On the non-universality of deep learning: quantifying the cost of symmetry. InAdvances in Neural Information Processing Systems (NeurIPS). 2022

work page 2022

[2] [2]

K. Ahn, N. Amsel, and J. Langford. Dion2: A simple method to shrink matrix in Muon.arXiv preprint 2512.16928, 2025

work page arXiv 2025

[3] [3]

K. Ahn, B. Xu, N. Abreu, Y. Fan, G. Magakyan, P. Sharma, Z. Zhan, and J. Langford. Dion: Distributed orthonormalized updates.arXiv preprint arXiv:2504.05295, 2025

work page arXiv 2025

[4] [4]

Ainslie, J

J. Ainslie, J. Lee-Thorp, M. De Jong, Y. Zemlyanskiy, F. Lebrón, and S. Sanghai. GQA: Training gen- eralized multi-query transformer models from multi-head checkpoints. InProceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP). 2023

work page 2023

[5] [5]

Amsel, D

N. Amsel, D. Persson, C. Musco, and R. M. Gower. The Polar Express: Optimal matrix sign methods and their application to the Muon algorithm. InInternational Conference on Learning Representations (ICLR). 2026

work page 2026

[6] [6]

K. An, Y. Liu, R. Pan, S. Ma, D. Goldfarb, and T. Zhang. ASGO: Adaptive structured gradient optimization. InAdvances in Neural Information Processing Systems (NeurIPS). 2025

work page 2025

[7] [7]

R. Anil, V. Gupta, T. Koren, K. Regan, and Y. Singer. Scalable second order optimization for deep learning.arXiv preprint arXiv:2002.09018, 2020

work page arXiv 2002

[8] [8]

Anthony, Y

Q. Anthony, Y. Tokpanov, S. Szot, S. Rajagopal, P. Medepalli, R. Iyer, V. Shyam, A. Golubeva, A. Chaurasia, et al. Training foundation models on a full-stack AMD platform: Compute, networking, and system design.arXiv preprint arXiv:2511.17127, 2025

work page arXiv 2025

[9] [9]

L. Autonne. Sur les groupes linéaires, réels et orthogonaux.Bulletin de la Société Mathématique de France, 30:121–134, 1902

work page 1902

[10] [10]

E. Bao, J. Lu, L. Song, N. Hart-Hodgson, W. Parson, and Y. Zhou. Equivariant neural networks and equivarification.arXiv preprint arXiv:1906.07172, 2019

work page arXiv 1906

[11] [11]

Bernstein

J. Bernstein. Deriving Muon. 2025

work page 2025

[12] [12]

Bernstein

J. Bernstein. Modular manifolds.Thinking Machines Lab: Connectionism, 2025. https:// thinkingmachines.ai/blog/modular-manifolds/

work page 2025

[13] [13]

Bernstein and L

J. Bernstein and L. Newhouse. Old optimizer, new norm: An anthology. InOPT 2024: Optimization for Machine Learning. 2024

work page 2024

[14] [14]

Bernstein and L

J. Bernstein and L. Newhouse. Modular duality in deep learning. InProceedings of the International Conference on Machine Learning (ICML). 2025

work page 2025

[15] [15]

Bernstein, Y.-X

J. Bernstein, Y.-X. Wang, K. Azizzadenesheli, and A. Anandkumar. signSGD: Compressed optimisation for non-convex problems. InProceedings of the International Conference on Machine Learning (ICML). 2018

work page 2018

[16] [16]

Bhatia.Matrix Analysis, volume 169

R. Bhatia.Matrix Analysis, volume 169. Springer Science & Business Media, 2013

work page 2013

[17] [17]

Boissin, T

T. Boissin, T. Massena, F. Mamalet, and M. Serrurier. Turbo-Muon: Accelerating orthogonality-based optimization with pre-conditioning.arXiv preprint arXiv:2512.04632, 2025

work page arXiv 2025

[18] [18]

Shashank Gupta, Chaitanya Ahuja, Tsung-Yu Lin, Sreya Dutta Ro y, Harrie Oosterhuis, Maarten de Rijke, and Satya Narayan Shukla

F. Bordes, R. Y. Pang, A. Ajay, A. C. Li, A. Bardes, S. Petryk, O. Mañas, Z. Lin, A. Mahmoud, et al. An introduction to vision-language modeling.arXiv preprint arXiv:2405.17247, 2024. 30

work page arXiv 2024

[19] [19]

T. B. Brown, B. Mann, N. Ryder, M. Subbiah, J. Kaplan, P. Dhariwal, A. Neelakantan, P. Shyam, G. Sastry, et al. Language models are few-shot learners. InAdvances in Neural Information Processing Systems (NeurIPS). 2020

work page 2020

[20] [20]

Buchanan

S. Buchanan. A faster manifold Muon with ADMM.https://sdbuchanan.com/blog/manifold-muon/, 2025

work page 2025

[21] [21]

Carlson, V

D. Carlson, V. Cevher, and L. Carin. Stochastic spectral descent for restricted Boltzmann machines. In Proceedings of the International Conference on Artificial Intelligence and Statistics (AISTATS). 2015

work page 2015

[22] [22]

Carlson, E

D. Carlson, E. Collins, Y.-P. Hsieh, L. Carin, and V. Cevher. Preconditioned spectral descent for deep learning. InAdvances in Neural Information Processing Systems (NeurIPS). 2015

work page 2015

[23] [23]

Carlson, Y.-P

D. Carlson, Y.-P. Hsieh, E. Collins, L. Carin, and V. Cevher. Stochastic spectral descent for discrete graphical models.IEEE Journal of Selected Topics in Signal Processing, 10(2):296–311, 2016

work page 2016

[24] [24]

On the Convergence of Muon and Beyond

D. Chang, Y. Liu, and G. Yuan. On the convergence of Muon and beyond.arXiv preprint arXiv:2509.15816, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[25] [25]

MuonEq: Balancing Before Orthogonalization with Lightweight Equilibration

D. Chang, Q. Shi, L. Zhang, Y. Li, R. Zhang, Y. Lu, Y. Liu, and G. Yuan. MuonEq: Balancing before orthogonalization with lightweight equilibration.arXiv preprint arXiv:2603.28254, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026

[26] [26]

L. Chen, J. Li, and Q. Liu. Muon optimizes under spectral norm constraints.arXiv preprint arXiv:2506.15054, 2025

work page arXiv 2025

[27] [27]

Y. Chen, Y. Chi, J. Fan, and C. Ma. Spectral methods for data science: A statistical perspective. Foundations and Trends®in Machine Learning, 14(5):566–806, 2021

work page 2021

[28] [28]

An exploration of non-euclidean gradient descent: Muon and its many variants.arXiv preprint arXiv:2510.09827,

M. Crawshaw, C. Modi, M. Liu, and R. M. Gower. An exploration of non-Euclidean gradient descent: Muon and its many variants.arXiv preprint arXiv:2510.09827, 2025

work page arXiv 2025

[29] [29]

G. E. Dahl, F. Schneider, Z. Nado, N. Agarwal, C. S. Sastry, P. Hennig, S. Medapati, R. Eschenhagen, P. Kasimbeg, et al. Benchmarking neural network training algorithms.arXiv preprint arXiv:2306.07179, 2023

work page arXiv 2023

[30] [30]

Dao and A

T. Dao and A. Gu. Transformers are SSMs: Generalized models and efficient algorithms through structured state space duality. InProceedings of the International Conference on Machine Learning (ICML). 2024

work page 2024

[31] [31]

Y. N. Dauphin, A. Fan, M. Auli, and D. Grangier. Language modeling with gated convolutional networks. InProceedings of the International Conference on Machine Learning (ICML). 2017

work page 2017

[32] [32]

Davis and D

D. Davis and D. Drusvyatskiy. When do spectral gradient updates help in deep learning?arXiv preprint arXiv:2512.04299, 2025

work page arXiv 2025

[33] [33]

DeepSeek-V4: Towards highly efficient million-token context intelligence

DeepSeek-AI. DeepSeek-V4: Towards highly efficient million-token context intelligence. 2026

work page 2026

[34] [34]

DeepSeek-AI, A. Liu, B. Feng, B. Xue, B. Wang, B. Wu, C. Lu, C. Zhao, C. Deng, et al. DeepSeek-V3 technical report.arXiv preprint arXiv:2412.19437, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[35] [35]

Dehghani, J

M. Dehghani, J. Djolonga, B. Mustafa, P. Padlewski, J. Heek, J. Gilmer, A. Steiner, M. Caron, R. Geirhos, et al. Scaling vision transformers to 22 billion parameters. InProceedings of the International Conference on Machine Learning (ICML). 2023

work page 2023

[36] [36]

S. Deng, Z. Ouyang, T. Pang, Z. Liu, R. Jin, S. Yu, and Y. Yang. RMNP: Row-momentum normalized preconditioning for scalable matrix-based optimization.arXiv preprint arXiv:2603.20527, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026

[37] [37]

Dewulf, D

A. Dewulf, D. Pai, L. Yang, A. Zhang, and B. Keigwin. Aurora: A leverage-aware optimizer for rectangular matrices. 2026. 31

work page 2026

[38] [38]

C. Ding, D. Sun, J. Sun, and K.-C. Toh. Spectral operators of matrices.Mathematical Programming, 168(1):509–531, 2018

work page 2018

[39] [39]

C. Ding, D. Sun, J. Sun, and K.-C. Toh. Spectral operators of matrices: Semismoothness and characterizations of the generalized Jacobian.SIAM Journal on Optimization, 30(1):630–659, 2020

work page 2020

[40] [40]

Dosovitskiy, L

A. Dosovitskiy, L. Beyer, A. Kolesnikov, D. Weissenborn, X. Zhai, T. Unterthiner, M. Dehghani, M. Minderer, G. Heigold, et al. An image is worth 16x16 words: Transformers for image recognition at scale. InInternational Conference on Learning Representations (ICLR). 2021

work page 2021

[41] [41]

Z. Du, H. He, and W. Su. Uncovering symmetry transfer in large language models via layer-peeled optimization.arXiv preprint arXiv:2605.12756, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026

[42] [42]

Du and W

Z. Du and W. Su. The Newton–Muon optimizer.arXiv preprint arXiv:2604.01472, 2026

work page arXiv 2026

[43] [43]

Duchi, E

J. Duchi, E. Hazan, and Y. Singer. Adaptive subgradient methods for online learning and stochastic optimization.Journal of Machine Learning Research, 12:2121–2159, 2011

work page 2011

[44] [44]

Eschenhagen, A

R. Eschenhagen, A. Cai, T.-H. Lee, and H.-J. M. Shi. Clarifying Shampoo: Adapting spectral descent to stochasticity and the parameter trajectory.arXiv preprint arXiv:2602.09314, 2026

work page arXiv 2026

[45] [45]

Eschenhagen, A

R. Eschenhagen, A. Immer, R. Turner, F. Schneider, and P. Hennig. Kronecker-factored approximate curvature for modern neural network architectures. InAdvances in Neural Information Processing Systems (NeurIPS). 2023

work page 2023

[46] [46]

Layer sharding for large-scale training with Muon.https://www.essential.ai/research/ infra, 2025

Essential AI. Layer sharding for large-scale training with Muon.https://www.essential.ai/research/ infra, 2025

work page 2025

[47] [47]

Essential AI, I. Shah, A. M. Polloreno, K. Stratos, P. Monk, A. Chaluvaraju, A. Hojel, A. Ma, A. Thomas, et al. Practical efficiency of Muon for pretraining.arXiv preprint arXiv:2505.02222, 2025

work page arXiv 2025

[48] [48]

Filatov, J

O. Filatov, J. Wang, J. Ebert, and S. Kesselheim. Optimal scaling needs optimal norm.arXiv preprint arXiv:2510.03871, 2025

work page arXiv 2025

[49] [49]

Frans, S

K. Frans, S. Levine, and P. Abbeel. A stable whitening optimizer for efficient neural network training. InAdvances in Neural Information Processing Systems (NeurIPS). 2025

work page 2025

[50] [50]

Gemma 3 Technical Report

Gemma Team, A. Kamath, J. Ferret, S. Pathak, N. Vieillard, R. Merhej, S. Perrin, T. Matejovicova, A. Ramé, et al. Gemma 3 technical report.arXiv preprint arXiv:2503.19786, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[51] [51]

Glentis, J

A. Glentis, J. Li, A. Han, and M. Hong. A minimalist optimizer design for LLM pretraining.arXiv preprint arXiv:2506.16659, 2025

work page internal anchor Pith review arXiv 2025

[52] [52]

GLM-4.5 Team, A. Zeng, X. Lv, Q. Zheng, Z. Hou, B. Chen, C. Xie, C. Wang, D. Yin, et al. GLM-4.5: Agentic, reasoning, and coding (ARC) foundation models.arXiv preprint arXiv:2508.06471, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[53] [53]

GLM-5 Team, A. Zeng, X. Lv, Z. Hou, Z. Du, Q. Zheng, B. Chen, D. Yin, C. Ge, et al. GLM-5: from vibe coding to agentic engineering.arXiv preprint arXiv:2602.15763, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026

[54] [54]

Goldfarb, Y

D. Goldfarb, Y. Ren, and A. Bahamou. Practical quasi-Newton methods for training deep neural networks. InAdvances in Neural Information Processing Systems (NeurIPS). 2020

work page 2020

[55] [55]

S. Gong, S. Agarwal, Y. Zhang, J. Ye, L. Zheng, M. Li, C. An, P. Zhao, W. Bi, et al. Scaling diffusion language models via adaptation from autoregressive models. InInternational Conference on Learning Representations (ICLR). 2025

work page 2025

[56] [56]

W. Gong, J. Zazo, Q. Luo, P. Wang, J. Hensman, and C. Ma. ARO: A new lens on matrix optimization for large models.arXiv preprint arXiv:2602.09006, 2026. 32

work page arXiv 2026

[57] [57]

Gonon, A.-A

A. Gonon, A.-A. Muşat, and N. Boumal. Insights on Muon from simple quadratics.arXiv preprint arXiv:2602.11948, 2026

work page arXiv 2026

[58] [58]

Gemma 4 model card

Google DeepMind. Gemma 4 model card. 2026

work page 2026

[59] [59]

Grishina, M

E. Grishina, M. Smirnov, and M. Rakhuba. Accelerating Newton-Schulz iteration for orthogonalization via Chebyshev-type polynomials.arXiv preprint arXiv:2506.10935, 2025

work page arXiv 2025

[60] [60]

Gu and T

A. Gu and T. Dao. Mamba: Linear-time sequence modeling with selective state spaces. InProceedings of the Conference on Language Modeling (COLM). 2024

work page 2024

[61] [61]

Gu and Z

Y. Gu and Z. Xie. Mano: Restriking manifold optimization for LLM training.arXiv preprint arXiv:2601.23000, 2026

work page arXiv 2026

[62] [62]

Gupta, T

V. Gupta, T. Koren, and Y. Singer. Shampoo: Preconditioned stochastic tensor optimization. In Proceedings of the International Conference on Machine Learning (ICML). 2018

work page 2018

[63] [63]

K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). 2016

work page 2016

[64] [64]

N. J. Higham. Computing the polar decomposition—with applications.SIAM Journal on Scientific and Statistical Computing, 7(4):1160–1174, 1986

work page 1986

[65] [65]

N. J. Higham. Stable iterations for the matrix square root.Numerical Algorithms, 15(2):227–242, 1997

work page 1997

[66] [66]

N. J. Higham.Functions of Matrices: Theory and Computation. Society for Industrial and Applied Mathematics, 2008

work page 2008

[67] [67]

Hoffmann, S

J. Hoffmann, S. Borgeaud, A. Mensch, E. Buchatskaya, T. Cai, E. Rutherford, D. de las Casas, L. A. Hendricks, J. Welbl, et al. Training compute-optimal large language models. InAdvances in Neural Information Processing Systems (NeurIPS). 2022

work page 2022

[68] [68]

R. A. Horn and C. R. Johnson.Topics in Matrix Analysis. Cambridge University Press, 1994

work page 1994

[69] [69]

R. A. Horn and C. R. Johnson.Matrix Analysis. Cambridge University Press, 2nd edition, 2012

work page 2012

[70] [70]

Y. Hu, H. Song, J. Deng, J. Wang, J. Chen, K. Zhou, Y. Zhu, J. Jiang, Z. Dong, et al. YuLan-Mini: Pushing the limits of open data-efficient language model. InProceedings of the Annual Meeting of the Association for Computational Linguistics (ACL) (Volume 1: Long Papers). 2025

work page 2025

[71] [71]

Huang, Y

F. Huang, Y. Luo, and S. Chen. LiMuon: Light and fast Muon optimizer for large models.arXiv preprint arXiv:2509.14562, 2025

work page arXiv 2025

[72] [72]

A. Q. Jiang, A. Sablayrolles, A. Roux, A. Mensch, B. Savary, C. Bamford, D. S. Chaplot, D. de las Casas, E. B. Hanna, et al. Mixtral of experts.arXiv preprint arXiv:2401.04088, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[73] [73]

Jiang, Z

R. Jiang, Z. Mhammedi, M. Mohri, and A. Mokhtari. Adaptive matrix online learning through smoothing with guarantees for nonsmooth nonconvex optimization.arXiv preprint arXiv:2602.08232, 2026

work page arXiv 2026

[74] [74]

Jiang, J

Z. Jiang, J. Gu, H. Zhu, and D. Pan. Pre-RMSNorm and Pre-CRMSNorm transformers: equivalent and efficient Pre-LN transformers. InAdvances in Neural Information Processing Systems (NeurIPS). 2023

work page 2023

[75] [75]

Jordan, J

K. Jordan, J. Bernstein, B. Rappazzo, @fernbear.bsky.social, B. Vlado, Y. Jiacheng, F. Cesista, B. Koszarsky, and @Grad62304977.modded-nanogpt: Speedrunning the NanoGPT baseline. 2024

work page 2024

[76] [76]

Jordan, Y

K. Jordan, Y. Jin, V. Boza, Y. Jiacheng, F. Cecista, L. Newhouse, and J. Bernstein. Muon: An optimizer for hidden layers in neural networks. 2024. 33

work page 2024

[77] [77]

Scaling Laws for Neural Language Models

J. Kaplan, S. McCandlish, T. Henighan, T. B. Brown, B. Chess, R. Child, S. Gray, A. Radford, J. Wu, et al. Scaling laws for neural language models.arXiv preprint arXiv:2001.08361, 2020

work page internal anchor Pith review Pith/arXiv arXiv 2001

[78] [78]

Kasimbeg, F

P. Kasimbeg, F. Schneider, R. Eschenhagen, J. Bae, C. S. Sastry, M. Saroufim, B. Feng, L. Wright, E. Z. Yang, et al. Accelerating neural network training: An analysis of the AlgoPerf competition. In International Conference on Learning Representations (ICLR). 2025

work page 2025

[79] [79]

G. Y. Kim and M.-h. Oh. Convergence of Muon with Newton-Schulz. InInternational Conference on Learning Representations (ICLR). 2026

work page 2026

[80] [80]

Kimi K2: Open Agentic Intelligence

Kimi Team. Kimi K2: Open agentic intelligence.arXiv preprint arXiv:2507.20534, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025