Muon as a Residual Connection

Hao Huang

arxiv: 2607.01124 · v1 · pith:5GF5WM52new · submitted 2026-07-01 · 💻 cs.LG · cs.AI

Muon as a Residual Connection

Hao Huang This is my paper

Pith reviewed 2026-07-02 15:37 UTC · model grok-4.3

classification 💻 cs.LG cs.AI

keywords Muon optimizerresidual connectionsrepresentation preservationneural network trainingupdate orthogonalizationlinear optimizationoptimizer design

0 comments

The pith

Muon optimizer implicitly adds a residual connection by orthogonalizing its updates.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper claims Muon succeeds because its update orthogonalization functions like a built-in residual link: it reduces how closely the step follows the current gradient but keeps the layer's output representation more stable and usable by layers further down the network. The authors demonstrate this trade-off in controlled linear problems, where Muon produces features that match a local target more slowly yet allow downstream layers to reach better solutions. This view reframes optimizer choice as balancing one-step descent against chain-wide compatibility rather than maximizing local speed alone. A reader would care if the same mechanism explains why Muon outperforms standard methods in large-scale deep training.

Core claim

Muon can be understood as an implicit residual connection during training. Specifically, orthogonalizing the update can sacrifice some immediate gradient fidelity while improving representation preservation for downstream layers. In controlled linear optimization settings, Muon learns representations that are slower to fit a local target but easier for downstream layers to exploit.

What carries the argument

Orthogonalization of the update vector, which reduces alignment with the instantaneous gradient in exchange for better preservation of the input representation passed to subsequent layers.

If this is right

Muon produces representations that downstream layers can exploit more readily than those from standard gradient steps.
The method deliberately accepts slower local target fitting in return for improved compatibility across the full depth.
Optimizer design can be guided by explicitly trading local descent speed against representation stability for later layers.
The residual-like effect appears in linear settings and is proposed as the source of Muon's empirical behavior in deep networks.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same orthogonalization idea could be inserted into other first-order methods to test whether they gain similar depth-wise benefits.
If the mechanism holds, networks with very long chains of layers should show the largest relative gains from Muon-style updates.
A direct test would replace Muon's update rule with plain gradient descent inside an otherwise identical architecture and check whether downstream-layer performance drops.
This framing points toward optimizers that optimize an objective spanning multiple layers rather than a single-step loss.

Load-bearing premise

The trade-off measured in simple linear models is what produces Muon's observed gains when the same rule is used inside large nonlinear networks.

What would settle it

Training a deep network with a version of Muon that skips the orthogonalization step and measuring whether its final accuracy or convergence speed becomes indistinguishable from Adam or SGD would test the claim.

Figures

Figures reproduced from arXiv: 2607.01124 by Hao Huang.

**Figure 1.** Figure 1: Composition loss under the τ schedule. Top: full trajectories on a logarithmic scale. Bottom: early-step zoom on a linear scale. Solid curves use SGD for W1, while dashed curves use Muon for W1; W2 is always trained with SGD. Colors indicate which layer is updated at each step: joint updates in green, W1-only segments in blue, and W2-only segments in red. 8 [PITH_FULL_IMAGE:figures/full_fig_p008_1.png] view at source ↗

**Figure 2.** Figure 2: Loss gap LSGD − LMuon across the full τ -schedule sweep. Negative values mean the SGD path has lower loss at the same step, while positive values mean the Muon path is ahead. The colored segments mark which layer is active [PITH_FULL_IMAGE:figures/full_fig_p013_2.png] view at source ↗

**Figure 3.** Figure 3: Early-step zoom of the loss gap in Figure 2. The zoom highlights the initial local [PITH_FULL_IMAGE:figures/full_fig_p013_3.png] view at source ↗

**Figure 4.** Figure 4: Mechanism diagnostic for τ = 200. Top: spectral flatness of W2 at the start of each W1 segment (higher means a flatter singular spectrum). The Muon path tends to leave a flatter downstream layer before each blue segment. Bottom: loss gap LSGD − LMuon; blue portions are W1-only segments and pale red portions are W2-only segments. Negative gap means SGD has lower loss. The orange dashed replay, computed at e… view at source ↗

read the original abstract

Muon has recently emerged as one of the most effective optimizers for training large neural networks, yet its empirical success has been explained from several different perspectives. In this paper, we propose a simple mechanistic interpretation: Muon can be understood as an implicit residual connection during training. Specifically, orthogonalizing the update can sacrifice some immediate gradient fidelity while improving representation preservation for downstream layers. We study this trade-off in controlled linear optimization settings, where Muon can learn representations that are slower to fit a local target but easier for downstream layers to exploit. Our results suggest a conceptual explanation for Muon and a design perspective for optimizers that balance local descent with downstream usability.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper offers a simple interpretive framing of Muon as an implicit residual connection based on linear optimization trade-offs, but supplies no results or deep-network checks to back it.

read the letter

The main point is that Muon can be viewed as trading some immediate gradient accuracy for better representation preservation downstream by orthogonalizing updates, with the idea tested in linear settings.

The new element is the explicit link to residual connections as a design lens. It does a clean job of spelling out why an optimizer might care about downstream usability rather than pure local descent, and it keeps the explanation short and tied to a familiar concept.

The soft spots are clear and central. The abstract states that the trade-off was studied in linear optimization settings, yet it gives no equations, no quantitative results, no error bars, and no criteria for the experiments. That makes it impossible to judge how real or strong the effect is. The bigger gap is the jump to deep non-linear networks: Muon’s reported success is in large models, but nothing here shows the same mechanism operates there or explains the gains. The linear-to-deep extrapolation is the least secure step.

This paper is for readers who follow optimizer interpretations in the optimization corner of ML. Someone after new methods, formal derivations, or reproducible measurements will not find them. The thinking is straightforward on its own terms, but the evidence does not support the central claim at the level needed for a serious explanation.

I would not bring this to a reading group or cite it. It does not look ready for peer review without substantial added experiments that close the linear-to-deep gap.

Referee Report

1 major / 0 minor

Summary. The paper claims that Muon can be interpreted as an implicit residual connection during training: orthogonalizing the update sacrifices some immediate gradient fidelity but improves representation preservation for downstream layers. This trade-off is examined in controlled linear optimization settings, where Muon learns representations that fit a local target more slowly but are easier for downstream layers to exploit, offering a conceptual explanation for Muon's empirical success in large neural networks and a design perspective for optimizers.

Significance. If the proposed mechanism generalizes, the work supplies a mechanistic account of an effective optimizer and a concrete design principle (balancing local descent against downstream usability) that could guide future optimizer development. The controlled linear studies constitute a clear, falsifiable starting point for the interpretation.

major comments (1)

[Abstract] Abstract: the central claim that the observed linear trade-off explains Muon's success in large neural networks rests on an untested extrapolation; the manuscript reports no experiments, ablations, or analysis in non-convex, multi-layer, non-linear regimes, leaving the load-bearing step from linear optimization to deep-network dynamics unsupported.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the constructive comment. We agree that the manuscript's scope is limited to linear settings and will revise the abstract to clarify that our results provide a conceptual hypothesis rather than a validated explanation for non-linear deep networks.

read point-by-point responses

Referee: [Abstract] Abstract: the central claim that the observed linear trade-off explains Muon's success in large neural networks rests on an untested extrapolation; the manuscript reports no experiments, ablations, or analysis in non-convex, multi-layer, non-linear regimes, leaving the load-bearing step from linear optimization to deep-network dynamics unsupported.

Authors: We agree that the extrapolation from linear to deep non-linear networks is untested in the manuscript. The paper explicitly studies the trade-off only in controlled linear optimization settings and positions the work as offering a conceptual explanation and design perspective, not as a direct mechanistic account of large-network training. To address this, we will revise the abstract (and any similar phrasing in the introduction) to emphasize the linear scope and to frame the connection to Muon's empirical success as a hypothesis motivated by the linear results rather than a demonstrated explanation. revision: yes

Circularity Check

0 steps flagged

No significant circularity; interpretation derived from linear studies without reduction to inputs

full rationale

The paper advances a mechanistic interpretation of Muon as an implicit residual connection, supported by trade-off observations in controlled linear optimization settings. No equations, fitted parameters, or self-citations are shown that would make any claim equivalent to its inputs by construction. The central claim is presented as arising from the described empirical studies rather than a self-definitional loop, fitted prediction, or imported uniqueness theorem. The derivation chain remains self-contained as an interpretive proposal without load-bearing reductions to prior fits or definitions.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review yields no explicit free parameters, axioms, or invented entities; the claim rests on the unstated premise that linear-setting behavior generalizes.

pith-pipeline@v0.9.1-grok · 5619 in / 1087 out tokens · 25414 ms · 2026-07-02T15:37:15.471148+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

12 extracted references · 10 canonical work pages · 5 internal anchors

[1]

Old Optimizer, New Norm: An Anthology

Jeremy Bernstein and Laker Newhouse. Old optimizer, new norm: An anthology.arXiv preprint arXiv:2409.20325,

work page internal anchor Pith review Pith/arXiv arXiv
[2]

Identity mappings in deep residual networks

Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Identity mappings in deep residual networks. InEuropean Conference on Computer Vision, pp. 630–645, 2016a. Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recog- nition. InProceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778...

work page arXiv
[3]

Difference Target Propagation

Dong-Hyun Lee, Saizheng Zhang, Asja Fischer, and Yoshua Bengio. Difference target propagation. arXiv preprint arXiv:1412.7525,

work page internal anchor Pith review Pith/arXiv arXiv
[4]

Interpreting the residual stream of ResNet18.arXiv preprint arXiv:2407.05340,

Andr´e Longon. Interpreting the residual stream of ResNet18.arXiv preprint arXiv:2407.05340,

work page arXiv
[5]

Muon Dynamics as a Spectral Wasserstein Flow

Gabriel Peyr´e. Muon dynamics as a spectral wasserstein flow.arXiv preprint arXiv:2604.04891,

work page internal anchor Pith review Pith/arXiv arXiv
[6]

Lions and muons: Optimization via stochastic frank–wolfe

Maria-Eleni Sfyraki and Jun-Kun Wang. Lions and muons: Optimization via stochastic frank–wolfe. arXiv preprint arXiv:2506.04192,

work page arXiv
[7]

Muon is Not That Special: Random or Inverted Spectra Work Just as Well

Zakhar Shumaylov, Nathael Da Costa, Peter Zaika, B´alint Mucs´anyi, Alex Massucco, Yoav Gelberg, Carola-Bibiane Sch¨onlieb, Yarin Gal, and Philipp Hennig. Muon is not that special: Random or inverted spectra work just as well.arXiv preprint arXiv:2605.11181,

work page internal anchor Pith review Pith/arXiv arXiv
[8]

How Muon’s spectral design benefits generalization: A study on imbalanced data.arXiv preprint arXiv:2510.22980,

Bhavya Vasudeva, Puneesh Deora, Yize Zhao, Vatsal Sharan, and Christos Thrampoulidis. How Muon’s spectral design benefits generalization: A study on imbalanced data.arXiv preprint arXiv:2510.22980,

work page arXiv
[9]

Shuche Wang, Fengzhuo Zhang, Jiaxiang Li, Cunxiao Du, Chao Du, Tianyu Pang, Zhuoran Yang, Mingyi Hong, and Vincent Y . F. Tan. Muon outperforms Adam in tail-end associative memory learning.arXiv preprint arXiv:2509.26030,

work page arXiv
[10]

Why Muon Outperforms Adam: A Curvature Perspective

Shuche Wang, Fengzhuo Zhang, Jiaxiang Li, Dirk Bergemann, and Zhuoran Yang. Why Muon outperforms Adam: A curvature perspective.arXiv preprint arXiv:2606.04662,

work page internal anchor Pith review Pith/arXiv arXiv
[11]

A.1 STEEPESTDESCENT UNDER THESPECTRALNORM Bernstein & Newhouse (2024) interpret Muon as steepest descent under the matrix spectral norm

A DETAILEDEXISTINGINTERPRETATIONS OFMUON For completeness, we provide a more detailed summary of the existing interpretations of Muon discussed in the main text. A.1 STEEPESTDESCENT UNDER THESPECTRALNORM Bernstein & Newhouse (2024) interpret Muon as steepest descent under the matrix spectral norm. LetG=∇f(W)and define Orth(G) =U V ⊤, G=UΣV ⊤. The steepest...

2024
[12]

Figure 2 plots this effect directly as the loss gapL SGD −L Muon, with Figure 3 zooming in on the first 2000 steps

The zoom highlights the initial local- optimization advantage of SGD onW 1 segments before Muon’s representation advantage propa- gates through laterW 2 updates. Figure 2 plots this effect directly as the loss gapL SGD −L Muon, with Figure 3 zooming in on the first 2000 steps. The early negative gaps show SGD’s local advantage on fixed-W 2 blue seg- ments...

2000

[1] [1]

Old Optimizer, New Norm: An Anthology

Jeremy Bernstein and Laker Newhouse. Old optimizer, new norm: An anthology.arXiv preprint arXiv:2409.20325,

work page internal anchor Pith review Pith/arXiv arXiv

[2] [2]

Identity mappings in deep residual networks

Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Identity mappings in deep residual networks. InEuropean Conference on Computer Vision, pp. 630–645, 2016a. Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recog- nition. InProceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778...

work page arXiv

[3] [3]

Difference Target Propagation

Dong-Hyun Lee, Saizheng Zhang, Asja Fischer, and Yoshua Bengio. Difference target propagation. arXiv preprint arXiv:1412.7525,

work page internal anchor Pith review Pith/arXiv arXiv

[4] [4]

Interpreting the residual stream of ResNet18.arXiv preprint arXiv:2407.05340,

Andr´e Longon. Interpreting the residual stream of ResNet18.arXiv preprint arXiv:2407.05340,

work page arXiv

[5] [5]

Muon Dynamics as a Spectral Wasserstein Flow

Gabriel Peyr´e. Muon dynamics as a spectral wasserstein flow.arXiv preprint arXiv:2604.04891,

work page internal anchor Pith review Pith/arXiv arXiv

[6] [6]

Lions and muons: Optimization via stochastic frank–wolfe

Maria-Eleni Sfyraki and Jun-Kun Wang. Lions and muons: Optimization via stochastic frank–wolfe. arXiv preprint arXiv:2506.04192,

work page arXiv

[7] [7]

Muon is Not That Special: Random or Inverted Spectra Work Just as Well

Zakhar Shumaylov, Nathael Da Costa, Peter Zaika, B´alint Mucs´anyi, Alex Massucco, Yoav Gelberg, Carola-Bibiane Sch¨onlieb, Yarin Gal, and Philipp Hennig. Muon is not that special: Random or inverted spectra work just as well.arXiv preprint arXiv:2605.11181,

work page internal anchor Pith review Pith/arXiv arXiv

[8] [8]

How Muon’s spectral design benefits generalization: A study on imbalanced data.arXiv preprint arXiv:2510.22980,

Bhavya Vasudeva, Puneesh Deora, Yize Zhao, Vatsal Sharan, and Christos Thrampoulidis. How Muon’s spectral design benefits generalization: A study on imbalanced data.arXiv preprint arXiv:2510.22980,

work page arXiv

[9] [9]

Shuche Wang, Fengzhuo Zhang, Jiaxiang Li, Cunxiao Du, Chao Du, Tianyu Pang, Zhuoran Yang, Mingyi Hong, and Vincent Y . F. Tan. Muon outperforms Adam in tail-end associative memory learning.arXiv preprint arXiv:2509.26030,

work page arXiv

[10] [10]

Why Muon Outperforms Adam: A Curvature Perspective

Shuche Wang, Fengzhuo Zhang, Jiaxiang Li, Dirk Bergemann, and Zhuoran Yang. Why Muon outperforms Adam: A curvature perspective.arXiv preprint arXiv:2606.04662,

work page internal anchor Pith review Pith/arXiv arXiv

[11] [11]

A.1 STEEPESTDESCENT UNDER THESPECTRALNORM Bernstein & Newhouse (2024) interpret Muon as steepest descent under the matrix spectral norm

A DETAILEDEXISTINGINTERPRETATIONS OFMUON For completeness, we provide a more detailed summary of the existing interpretations of Muon discussed in the main text. A.1 STEEPESTDESCENT UNDER THESPECTRALNORM Bernstein & Newhouse (2024) interpret Muon as steepest descent under the matrix spectral norm. LetG=∇f(W)and define Orth(G) =U V ⊤, G=UΣV ⊤. The steepest...

2024

[12] [12]

Figure 2 plots this effect directly as the loss gapL SGD −L Muon, with Figure 3 zooming in on the first 2000 steps

The zoom highlights the initial local- optimization advantage of SGD onW 1 segments before Muon’s representation advantage propa- gates through laterW 2 updates. Figure 2 plots this effect directly as the loss gapL SGD −L Muon, with Figure 3 zooming in on the first 2000 steps. The early negative gaps show SGD’s local advantage on fixed-W 2 blue seg- ments...

2000