Orth-Dion: Eliminating Geometric Mismatch in Distributed Low-Rank Spectral Optimization

Ansh Tiwari; Ganesh Talluri; Guillaume Rabusseau; Hideyuki Kawashima; Hiroki Naganuma; Ioannis Mitliagkas; Laura Gomezjurado Gonzalez; Tatsuhiro Nakamori

arxiv: 2605.16341 · v1 · pith:W4COAQUFnew · submitted 2026-05-07 · 💻 cs.LG

Orth-Dion: Eliminating Geometric Mismatch in Distributed Low-Rank Spectral Optimization

Tatsuhiro Nakamori , Laura Gomezjurado Gonzalez , Ganesh Talluri , Ansh Tiwari , Hideyuki Kawashima , Ioannis Mitliagkas , Guillaume Rabusseau , Hiroki Naganuma This is my paper

Pith reviewed 2026-05-20 22:19 UTC · model grok-4.3

classification 💻 cs.LG

keywords low-rank gradient compressiondistributed optimizationspectral methodserror feedbacknon-Euclidean smoothnessQR orthogonalizationMuon approximationconvergence rate

0 comments

The pith

Orth-Dion replaces column normalization with QR orthogonalization to eliminate geometric mismatch in low-rank spectral optimization and restore optimal convergence rates at Dion's communication cost.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper identifies that Dion approximates Muon via one power iteration step followed by column normalization, but this normalization fails to produce the rank-r polar factor that satisfies the dual-norm constraint of low-rank spectral geometry. The resulting direction introduces an extra sqrt(r) factor into the convergence rate, the smoothness term, and the error-feedback recursion even when the low-rank gradient approximation itself is accurate. Orth-Dion corrects the mismatch by applying QR orthogonalization to the right factor instead. Under non-Euclidean smoothness with curvature constant L_r along rank-r directions, Orth-Dion achieves the rate O(sqrt(L_r/T)) that matches exact spectral methods while keeping the same per-step communication cost as Dion. The analysis uses a self-consistent fixed-point argument to remove the usual bounded-drift assumption and relies on time-averaged contraction of the error sequence rather than step-by-step contraction.

Core claim

By replacing column normalization with QR orthogonalization of the right factor, Orth-Dion ensures the low-rank direction satisfies the dual-norm constraint of the rank-r spectral geometry. This removes the sqrt(r) penalty from the rate, the smoothness term, and the error-feedback analysis. The method attains O(sqrt(L_r/T)) convergence under non-Euclidean smoothness with L_r the curvature constant along rank-r directions, at the same per-step communication cost as Dion. The proof removes the bounded-drift assumption common in prior error-feedback analyses via a self-consistent fixed-point argument and uses a time-averaged contraction that only requires the error sequence to contract on the r

What carries the argument

QR orthogonalization of the right factor in the rank-r gradient representation to recover the proper polar factor that satisfies the low-rank spectral dual-norm constraint.

If this is right

The convergence rate matches that of exact spectral methods such as Muon without increasing communication volume.
The extra sqrt(r) factor disappears from both the smoothness term and the error-feedback recursion.
The self-consistent fixed-point argument removes the bounded-drift assumption required by prior analyses.
Time-averaged contraction suffices for the error sequence rather than requiring contraction at every step.
Experiments on language-model pre-training confirm the predicted sqrt(r) scaling and close the gap to Muon.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The time-averaged contraction technique may simplify proofs for other error-feedback schemes that only achieve average rather than per-step contraction.
Similar geometric mismatches may exist in other low-rank approximations that rely on normalization instead of orthogonalization.
Prioritizing exact recovery of the polar factor could improve rates in additional distributed compression settings beyond spectral methods.
The approach suggests that geometric fidelity to the target optimizer's constraint set is a first-order design choice for low-rank methods.

Load-bearing premise

Non-Euclidean smoothness holds with a well-defined finite curvature constant L_r along the rank-r directions of interest.

What would settle it

A large-scale training run in which switching from column normalization to QR orthogonalization leaves the observed convergence gap to full Muon unchanged or retains the sqrt(r) slowdown in measured wall-clock or step-wise progress.

Figures

Figures reproduced from arXiv: 2605.16341 by Ansh Tiwari, Ganesh Talluri, Guillaume Rabusseau, Hideyuki Kawashima, Hiroki Naganuma, Ioannis Mitliagkas, Laura Gomezjurado Gonzalez, Tatsuhiro Nakamori.

**Figure 1.** Figure 1: Proposed methods improve both convergence and wallclock. (a) At matched rank, Orth-Dion and Ada-OrthDion reach Dion’s plateau earlier and keep going. (b) Adaptive rank absorbs Orth-Dion’s QR overhead and matches Dion’s per-step time on the 17.1B model. (Ada-Orth-Dion’s rank is pinned at 0.93 rf , the steady-state rank reached on the 320M model; see App. L.2.) At first glance, the remaining gap between Dio… view at source ↗

**Figure 2.** Figure 2: The dual-norm mismatch is real, rank-ordered, and persistent (Llama 3 320M, r∈{96, 192, 384}, mean over 3 seeds; band ±std). (a) Dion’s ν¯t separates cleanly by rank and stays well above 1 throughout training; the right axis shows ν¯ 2 t , the factor through which νt enters the smoothness term in (1). (b) Per-update νt pooled over layer, step, and seed: the inflation is dispersed, not transient. Orth-Dion … view at source ↗

**Figure 3.** Figure 3: Training dynamics for Llama 3 320M (6,100 steps, 8×GH200 FSDP). (a) Orth-Dion reaches Dion’s best loss at matched rank in ∼ 0.89× wall-clock time across r ∈ {96, 192, 384}. (b) At every rank, Orth-Dion lands at lower C4 validation loss than Dion, consistent with Lemma 4.1. (c) Ada-Orth-Dion (init r=384) has r ≈ 357 on average; the fixed-rank curves are flat by construction. On Llama 3 320M, the residual an… view at source ↗

read the original abstract

Low-rank gradient compression reduces communication in distributed training by representing updates with rank-$r$ factors. Dion is a recent method that approximates Muon, a spectral optimizer that orthogonalizes momentum, using one step of power iteration followed by column normalization (rescaling each column of the right factor to unit length). This makes it compatible with fully sharded data parallel training, but it converges more slowly than full-rank spectral methods. We show that this gap is geometric: column normalization does not yield the rank-$r$ polar factor that Muon implicitly targets, so the resulting direction violates the dual-norm constraint of the low-rank spectral geometry, and the rate picks up an extra factor of $\sqrt{r}$ even though the low-rank approximation of the gradient itself is accurate. The same mismatch enters the smoothness term and the error-feedback recursion in the analysis, which has a knock-on effect on empirical performance. We propose Orth-Dion, which replaces column normalization with QR orthogonalization of the right factor. Under non-Euclidean smoothness, with $L_r$ the curvature constant along rank-$r$ directions, Orth-Dion attains rate $O(\sqrt{L_r/T})$, matching exact spectral methods at the same per-step communication cost as Dion. The proof removes the bounded-drift assumption common in prior error-feedback analyses via a self-consistent fixed-point argument, and uses a time-averaged contraction that only requires the error sequence to contract on average rather than at every step. Experiments on large-scale language model pre-training validate the predicted $\sqrt{r}$ scaling and show that Orth-Dion closes the convergence gap to Muon at Dion's communication cost.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

read the letter

Orth-Dion swaps column normalization for QR on the right factor to fix the polar geometry mismatch in Dion, and claims this removes the extra sqrt(r) factor in the rate under a rank-r non-Euclidean smoothness assumption. The paper derives the mismatch directly from the definition of the polar factor rather than from earlier citations, which keeps the circularity low. It also introduces a time-averaged contraction argument for error feedback that only needs average contraction instead of step-by-step bounds, and uses a self-consistent fixed-point step to drop the usual bounded-drift hypothesis. Experiments on language model pre-training are said to confirm the sqrt(r) scaling and close the gap to Muon at Dion's communication cost. That combination is the concrete advance here. The geometric diagnosis is clear and the rate statement matches what exact spectral methods achieve at the same per-step cost. The analysis stays grounded in the low-rank setting without obvious fitting tricks. The main soft spot is the non-Euclidean smoothness assumption with curvature L_r restricted to rank-r directions. It is not obvious how restrictive this is in practice or whether the fixed-point construction truly works for arbitrary average-contracting error sequences without reintroducing a comparable uniform bound. The abstract sketches the steps but leaves the precise definition of L_r and the fixed-point existence argument thin, so those pieces need checking. This paper is aimed at people working on distributed low-rank compression for large-model training where communication is the bottleneck. A reader focused on spectral optimizers or error-feedback analysis will find the targeted fix and the new contraction argument useful. It deserves a serious referee because the idea is specific, the claimed improvement is measurable, and the geometric point is worth verifying even if the assumptions require tightening.

Referee Report

3 major / 2 minor

Summary. The manuscript introduces Orth-Dion, a low-rank spectral optimizer for distributed training that replaces Dion's column normalization with QR orthogonalization of the right factor. It identifies a geometric mismatch in Dion that violates the rank-r polar factor and dual-norm constraint, incurring an extra sqrt(r) factor in the rate. Under a non-Euclidean smoothness assumption with curvature constant L_r along rank-r directions, Orth-Dion is claimed to achieve O(sqrt(L_r/T)) convergence, matching exact spectral methods at Dion's per-step communication cost. The analysis removes the standard bounded-drift hypothesis on error feedback via a self-consistent fixed-point argument that relies only on time-averaged contraction of the error sequence. Experiments on large-scale language-model pre-training are said to confirm the sqrt(r) scaling and close the gap to Muon.

Significance. If the central claims hold, the work offers a practical and theoretically grounded improvement to communication-efficient distributed optimization by restoring the geometric properties of spectral methods without extra communication. The self-consistent fixed-point technique for error-feedback analysis, if rigorously established, could apply more broadly to analyses that currently rely on uniform drift bounds. The explicit identification of the geometric mismatch and the predicted scaling provide a clear, falsifiable contribution.

major comments (3)

[Analysis (rate derivation and smoothness assumption)] The non-Euclidean smoothness assumption with curvature constant L_r (invoked for the O(sqrt(L_r/T)) rate) lacks an explicit definition or construction of L_r in terms of the underlying loss; without this, it is difficult to verify whether the assumption is strictly weaker than standard Euclidean smoothness or whether it holds for typical deep-learning objectives along rank-r subspaces.
[Proof of convergence rate (fixed-point step)] The self-consistent fixed-point argument used to eliminate the bounded-drift hypothesis (via time-averaged contraction) does not visibly establish existence or uniqueness of the fixed point for arbitrary error sequences that merely contract on average. If the argument tacitly requires a uniform bound on the initial error or drift to guarantee the fixed point, the claimed removal of the bounded-drift assumption would be weakened to a condition of comparable strength.
[Algorithm description and complexity analysis] The claim that Orth-Dion matches the rate of exact spectral methods at identical per-step communication cost requires an explicit accounting of the communication volume of the QR step versus column normalization; the current description leaves open whether the orthogonalization introduces hidden factors that affect the overall complexity comparison.

minor comments (2)

[Abstract and §1] The abstract and introduction would benefit from a one-sentence reminder of the dual-norm constraint that the polar factor satisfies, to make the geometric-mismatch argument immediately accessible.
[Preliminaries] Notation for the right factor and its normalization/orthogonalization should be introduced with an equation (e.g., defining the column-normalized versus QR version) before the mismatch is discussed.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the insightful comments on our manuscript. Below we address each major comment point by point, outlining the revisions we intend to make.

read point-by-point responses

Referee: The non-Euclidean smoothness assumption with curvature constant L_r (invoked for the O(sqrt(L_r/T)) rate) lacks an explicit definition or construction of L_r in terms of the underlying loss; without this, it is difficult to verify whether the assumption is strictly weaker than standard Euclidean smoothness or whether it holds for typical deep-learning objectives along rank-r subspaces.

Authors: We agree that an explicit definition and construction of L_r is necessary for clarity. In the revised manuscript, we will introduce a precise definition: L_r is the curvature constant of the loss restricted to the manifold of rank-at-most-r matrices, defined as the supremum over all rank-r points X of the operator norm of the Hessian projected onto the tangent space at X. This assumption is strictly weaker than standard Euclidean smoothness with constant L, since L_r can be bounded independently of the full dimension when the loss landscape is approximately low-rank. We will also provide a short discussion on its validity for deep learning objectives, noting that in overparameterized models the effective curvature along low-rank directions is often much smaller than the global L. revision: yes
Referee: The self-consistent fixed-point argument used to eliminate the bounded-drift hypothesis (via time-averaged contraction) does not visibly establish existence or uniqueness of the fixed point for arbitrary error sequences that merely contract on average. If the argument tacitly requires a uniform bound on the initial error or drift to guarantee the fixed point, the claimed removal of the bounded-drift assumption would be weakened to a condition of comparable strength.

Authors: The referee correctly identifies a point that requires clarification in the proof. The current presentation relies on the time-averaged contraction to define the fixed point implicitly but does not detail the existence proof. We will revise the relevant section and appendix to rigorously establish existence by considering the sequence of Cesaro means and showing convergence to a fixed point of the averaged operator. Uniqueness follows from the strict contraction on average. Importantly, this construction does not require a uniform bound on the initial error or drift; it holds as long as the average contraction factor is strictly less than one, which is guaranteed by our assumptions. We will make these steps explicit in the revision. revision: yes
Referee: The claim that Orth-Dion matches the rate of exact spectral methods at identical per-step communication cost requires an explicit accounting of the communication volume of the QR step versus column normalization; the current description leaves open whether the orthogonalization introduces hidden factors that affect the overall complexity comparison.

Authors: We will add an explicit complexity analysis in the revised manuscript. The QR orthogonalization of the right factor (size model_dim x r) is performed entirely locally on each device and does not involve any additional inter-device communication. The communicated data per step consists of the same low-rank factors as in Dion: the left factor and the orthogonalized right factor. Column normalization in Dion is similarly a local rescaling operation. Thus, the per-step communication volume remains identical, specifically O((d + m) * r / P) where P is the number of workers, with no hidden factors introduced by QR. We will include a table or paragraph comparing the two methods' communication and computation costs. revision: yes

Circularity Check

0 steps flagged

No significant circularity; derivation self-contained

full rationale

The paper's central rate O(sqrt(L_r/T)) is obtained by direct analysis under the non-Euclidean smoothness assumption with curvature L_r along rank-r directions, using a self-consistent fixed-point argument on the error sequence and time-averaged contraction. This fixed-point construction is introduced in the proof and does not reduce by the paper's equations to a fitted quantity or prior self-citation. The geometric mismatch between column normalization and the rank-r polar factor is derived explicitly from the definition of the polar factor itself. No load-bearing step equates the claimed output to its inputs by construction, and the analysis remains independent of any self-citation chain or ansatz smuggled from prior author work.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The central claim rests on non-Euclidean smoothness along rank-r directions and on the validity of the self-consistent fixed-point argument for removing the bounded-drift assumption; no free parameters or new invented entities are introduced.

axioms (2)

domain assumption Non-Euclidean smoothness with a curvature constant L_r defined along rank-r directions
Invoked to state the O(sqrt(L_r/T)) rate and to bound the smoothness term in the analysis.
ad hoc to paper The error-feedback sequence admits a self-consistent fixed point that contracts on average
Used to remove the bounded-drift assumption common in prior error-feedback analyses.

pith-pipeline@v0.9.0 · 5866 in / 1460 out tokens · 25068 ms · 2026-05-20T22:19:11.144054+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Under non-Euclidean smoothness, with L_r the curvature constant along rank-r directions, Orth-Dion attains rate O(√(L_r/T)) ... via a self-consistent fixed-point argument, and uses a time-averaged contraction
IndisputableMonolith/Foundation/AbsoluteFloorClosure.lean absolute_floor_iff_bare_distinguishability unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

ν_t = ∥D̂_t∥_{(r)} ... column normalization can introduce a rank-dependent penalty

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

30 extracted references · 30 canonical work pages · 5 internal anchors

[1]

Old Optimizer, New Norm: An Anthology

Old Optimizer, New Norm: An Anthology , author =. 2024 , note =. doi:10.48550/arXiv.2409.20325 , url =

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2409.20325 2024
[2]

Advances in Neural Information Processing Systems , volume =

Preconditioned Spectral Descent for Deep Learning , author =. Advances in Neural Information Processing Systems , volume =. 2015 , url =

work page 2015
[3]

Dion: Distributed Orthonormalized Updates

Dion: Distributed orthonormalized updates , author=. arXiv preprint arXiv:2504.05295 , year=

work page arXiv
[4]

arXiv preprint arXiv:2512.16928 , year=

Dion2: A Simple Method to Shrink Matrix in Muon , author=. arXiv preprint arXiv:2512.16928 , year=

work page arXiv
[5]

Proceedings of the VLDB Endowment , volume=

PyTorch FSDP: Experiences on Scaling Fully Sharded Data Parallel , author=. Proceedings of the VLDB Endowment , volume=

work page
[6]

arXiv preprint arXiv:2510.16981 , year=

MuonBP: Faster Muon via Block-Periodic Orthogonalization , author=. arXiv preprint arXiv:2510.16981 , year=

work page arXiv
[7]

Scaling Laws for Neural Language Models

Scaling laws for neural language models , author=. arXiv preprint arXiv:2001.08361 , year=

work page internal anchor Pith review Pith/arXiv arXiv 2001
[8]

Muon: An optimizer for hidden layers in neural networks , year =

Keller Jordan and Yuchen Jin and Vlado Boza and You Jiacheng and Franz Cesista and Laker Newhouse and Jeremy Bernstein , howpublished =. Muon: An optimizer for hidden layers in neural networks , year =

work page
[9]

Fashion-MNIST: a Novel Image Dataset for Benchmarking Machine Learning Algorithms

Fashion-MNIST: a novel image dataset for benchmarking machine learning algorithms , author=. arXiv preprint arXiv:1708.07747 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[10]

International Conference on Machine Learning , pages=

signSGD: Compressed optimisation for non-convex problems , author=. International Conference on Machine Learning , pages=. 2018 , organization=

work page 2018
[11]

International Conference on Learning Representations , year=

AdaLoRA: Adaptive Budget Allocation for Parameter-Efficient Fine-Tuning , author=. International Conference on Learning Representations , year=

work page
[12]

International Conference on Learning Representations , year=

LoRA: Low-Rank Adaptation of Large Language Models , author=. International Conference on Learning Representations , year=

work page
[13]

The Llama 3 Herd of Models

The Llama 3 Herd of Models , author=. arXiv preprint arXiv:2407.21783 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[14]

Journal of Machine Learning Research , volume=

Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer , author=. Journal of Machine Learning Research , volume=

work page
[15]

Training Compute-Optimal Large Language Models

Training Compute-Optimal Large Language Models , author=. arXiv preprint arXiv:2203.15556 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[16]

torchtitan: One-stop PyTorch native solution for production ready LLM pre-training , author=

work page
[17]

arXiv preprint arXiv:2602.03001 , year=

Adaptive Batch Sizes Using Non-Euclidean Gradient Noise Scales for Stochastic Sign and Spectral Descent , author=. arXiv preprint arXiv:2602.03001 , year=

work page arXiv
[18]

1-bit stochastic gradient descent and its application to data-parallel distributed training of speech

Seide, Frank and Fu, Hao and Droppo, Jasha and Li, Gang and Yu, Dong , booktitle=. 1-bit stochastic gradient descent and its application to data-parallel distributed training of speech

work page
[19]

Sparsified

Stich, Sebastian U and Cordonnier, Jean-Baptiste and Jaggi, Martin , booktitle=. Sparsified

work page
[20]

Advances in Neural Information Processing Systems , volume=

Richt. Advances in Neural Information Processing Systems , volume=

work page
[21]

Advances in Neural Information Processing Systems , year=

PowerSGD: Practical low-rank gradient compression for distributed optimization , author=. Advances in Neural Information Processing Systems , year=

work page
[22]

Alistarh, Dan and Grubic, Demjan and Li, Jerry and Tomioka, Ryota and Vojnovic, Milan , booktitle=

work page
[23]

2012 , publisher=

Matrix Analysis , author=. 2012 , publisher=

work page 2012
[24]

International Conference on Machine Learning , pages =

Shampoo: Preconditioned Stochastic Tensor Optimization , author =. International Conference on Machine Learning , pages =. 2018 , organization =

work page 2018
[25]

Vyas, Nikhil and Morwani, Depen and Zhao, Rosie and Kwun, Mujin and Shapira, Itai and Brandfonbrener, David and Janson, Lucas and Kakade, Sham , journal =

work page
[26]

Modular duality in deep learning.arXiv preprint arXiv:2410.21265,

Modular Duality in Deep Learning , author =. arXiv preprint arXiv:2410.21265 , year =

work page arXiv
[27]

Zhao, Jiawei and Zhang, Zhenyu and Chen, Beidi and Wang, Zhangyang and Anandkumar, Anima and Tian, Yuandong , booktitle =

work page
[28]

2024 , note =

Lialin, Vladislav and Muckatira, Sherin and Shivagunde, Namrata and Rumshisky, Anna , booktitle =. 2024 , note =

work page 2024
[29]

and Jaggi, Martin , booktitle =

Karimireddy, Sai Praneeth and Rebjock, Quentin and Stich, Sebastian U. and Jaggi, Martin , booktitle =. Error Feedback Fixes. 2019 , organization =

work page 2019
[30]

arXiv preprint arXiv:2507.01598 , year=

Convergence Bound and Critical Batch Size of Muon Optimizer , author=. arXiv preprint arXiv:2507.01598 , year=

work page arXiv

[1] [1]

Old Optimizer, New Norm: An Anthology

Old Optimizer, New Norm: An Anthology , author =. 2024 , note =. doi:10.48550/arXiv.2409.20325 , url =

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2409.20325 2024

[2] [2]

Advances in Neural Information Processing Systems , volume =

Preconditioned Spectral Descent for Deep Learning , author =. Advances in Neural Information Processing Systems , volume =. 2015 , url =

work page 2015

[3] [3]

Dion: Distributed Orthonormalized Updates

Dion: Distributed orthonormalized updates , author=. arXiv preprint arXiv:2504.05295 , year=

work page arXiv

[4] [4]

arXiv preprint arXiv:2512.16928 , year=

Dion2: A Simple Method to Shrink Matrix in Muon , author=. arXiv preprint arXiv:2512.16928 , year=

work page arXiv

[5] [5]

Proceedings of the VLDB Endowment , volume=

PyTorch FSDP: Experiences on Scaling Fully Sharded Data Parallel , author=. Proceedings of the VLDB Endowment , volume=

work page

[6] [6]

arXiv preprint arXiv:2510.16981 , year=

MuonBP: Faster Muon via Block-Periodic Orthogonalization , author=. arXiv preprint arXiv:2510.16981 , year=

work page arXiv

[7] [7]

Scaling Laws for Neural Language Models

Scaling laws for neural language models , author=. arXiv preprint arXiv:2001.08361 , year=

work page internal anchor Pith review Pith/arXiv arXiv 2001

[8] [8]

Muon: An optimizer for hidden layers in neural networks , year =

Keller Jordan and Yuchen Jin and Vlado Boza and You Jiacheng and Franz Cesista and Laker Newhouse and Jeremy Bernstein , howpublished =. Muon: An optimizer for hidden layers in neural networks , year =

work page

[9] [9]

Fashion-MNIST: a Novel Image Dataset for Benchmarking Machine Learning Algorithms

Fashion-MNIST: a novel image dataset for benchmarking machine learning algorithms , author=. arXiv preprint arXiv:1708.07747 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[10] [10]

International Conference on Machine Learning , pages=

signSGD: Compressed optimisation for non-convex problems , author=. International Conference on Machine Learning , pages=. 2018 , organization=

work page 2018

[11] [11]

International Conference on Learning Representations , year=

AdaLoRA: Adaptive Budget Allocation for Parameter-Efficient Fine-Tuning , author=. International Conference on Learning Representations , year=

work page

[12] [12]

International Conference on Learning Representations , year=

LoRA: Low-Rank Adaptation of Large Language Models , author=. International Conference on Learning Representations , year=

work page

[13] [13]

The Llama 3 Herd of Models

The Llama 3 Herd of Models , author=. arXiv preprint arXiv:2407.21783 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[14] [14]

Journal of Machine Learning Research , volume=

Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer , author=. Journal of Machine Learning Research , volume=

work page

[15] [15]

Training Compute-Optimal Large Language Models

Training Compute-Optimal Large Language Models , author=. arXiv preprint arXiv:2203.15556 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[16] [16]

torchtitan: One-stop PyTorch native solution for production ready LLM pre-training , author=

work page

[17] [17]

arXiv preprint arXiv:2602.03001 , year=

Adaptive Batch Sizes Using Non-Euclidean Gradient Noise Scales for Stochastic Sign and Spectral Descent , author=. arXiv preprint arXiv:2602.03001 , year=

work page arXiv

[18] [18]

1-bit stochastic gradient descent and its application to data-parallel distributed training of speech

Seide, Frank and Fu, Hao and Droppo, Jasha and Li, Gang and Yu, Dong , booktitle=. 1-bit stochastic gradient descent and its application to data-parallel distributed training of speech

work page

[19] [19]

Sparsified

Stich, Sebastian U and Cordonnier, Jean-Baptiste and Jaggi, Martin , booktitle=. Sparsified

work page

[20] [20]

Advances in Neural Information Processing Systems , volume=

Richt. Advances in Neural Information Processing Systems , volume=

work page

[21] [21]

Advances in Neural Information Processing Systems , year=

PowerSGD: Practical low-rank gradient compression for distributed optimization , author=. Advances in Neural Information Processing Systems , year=

work page

[22] [22]

Alistarh, Dan and Grubic, Demjan and Li, Jerry and Tomioka, Ryota and Vojnovic, Milan , booktitle=

work page

[23] [23]

2012 , publisher=

Matrix Analysis , author=. 2012 , publisher=

work page 2012

[24] [24]

International Conference on Machine Learning , pages =

Shampoo: Preconditioned Stochastic Tensor Optimization , author =. International Conference on Machine Learning , pages =. 2018 , organization =

work page 2018

[25] [25]

Vyas, Nikhil and Morwani, Depen and Zhao, Rosie and Kwun, Mujin and Shapira, Itai and Brandfonbrener, David and Janson, Lucas and Kakade, Sham , journal =

work page

[26] [26]

Modular duality in deep learning.arXiv preprint arXiv:2410.21265,

Modular Duality in Deep Learning , author =. arXiv preprint arXiv:2410.21265 , year =

work page arXiv

[27] [27]

Zhao, Jiawei and Zhang, Zhenyu and Chen, Beidi and Wang, Zhangyang and Anandkumar, Anima and Tian, Yuandong , booktitle =

work page

[28] [28]

2024 , note =

Lialin, Vladislav and Muckatira, Sherin and Shivagunde, Namrata and Rumshisky, Anna , booktitle =. 2024 , note =

work page 2024

[29] [29]

and Jaggi, Martin , booktitle =

Karimireddy, Sai Praneeth and Rebjock, Quentin and Stich, Sebastian U. and Jaggi, Martin , booktitle =. Error Feedback Fixes. 2019 , organization =

work page 2019

[30] [30]

arXiv preprint arXiv:2507.01598 , year=

Convergence Bound and Critical Batch Size of Muon Optimizer , author=. arXiv preprint arXiv:2507.01598 , year=

work page arXiv