arxiv: 2605.07815 · v1 · submitted 2026-05-08 · 💻 cs.LG · cs.CL

Recognition: 2 theorem links

· Lean Theorem

OrScale: Orthogonalised Optimization with Layer-Wise Trust-Ratio Scaling

Yuxuan Lou , Yang You

Authors on Pith no claims yet

Pith reviewed 2026-05-11 02:42 UTC · model grok-4.3

classification 💻 cs.LG cs.CL

keywords neural network optimizationMuon optimizertrust ratio scalinglayer-wise adaptationorthogonal updateslanguage model pretrainingCIFAR-10 training

0 comments

The pith

OrScale extends Muon by scaling each layer's update with the Frobenius norm of the actual parameter-space direction.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces OrScale as a trust-ratio extension of Muon that sets the denominator of each layer's ratio to the Frobenius norm of the real update direction to be applied, paired with coupled weight decay. This rule is shown to sidestep the shape-degenerate denominators, momentum-clip saturation, and decoupled weight-decay runaway that break three natural Muon-LAMB hybrids. The method supplies an O(1/sqrt(T)) convergence guarantee in nuclear norm, a strict layer-adaptive descent property, and calibration that preserves muP-style learning-rate transfer. Empirically it improves validation accuracy on CIFAR-10/DavidNet and language-model pre-training on FineWeb-Edu at scales up to 1.1B while beating AdamW everywhere tested. A reader cares because the change gives more reliable per-layer magnitude control without global retuning.

Core claim

OrScale is a trust-ratio extension of Muon built on the rule that the denominator of a layer-wise ratio should measure the Frobenius norm of the actual parameter-space direction that will be applied. This yields OrScale for general matrix layers and OrScale-LM for language models, where Moonlight shape scaling is combined with one-time per-layer calibration so every trust ratio starts at one. The real-update-direction denominator with coupled weight decay avoids the three failure modes of shape-degenerate denominators, raw-momentum clip saturation, and decoupled weight-decay runaway.

What carries the argument

The real-update-direction Frobenius-norm denominator together with coupled weight decay

If this is right

OrScale ranks first on CIFAR-10/DavidNet across three seeds, raising Muon validation top-1 from 93.70% to 94.05%.
OrScale-LM improves FineWeb-Edu pre-training versus Muon+Moonlight at three of four scales from 125M to 1.1B parameters.
OrScale-LM outperforms AdamW at every tested scale.
OrScale admits an O(1/sqrt(T)) nonconvex convergence guarantee in a nuclear-norm criterion and a strict layer-adaptive descent gain under measurable layer heterogeneity.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same real-update-direction norm rule could be inserted into other orthogonalized or matrix-aware optimizers to obtain similar layer-wise adaptation.
Calibration that forces initial trust ratios to one may simplify hyperparameter transfer when moving from small to large models in new domains.
If the nuclear-norm convergence holds in practice it would suggest OrScale is especially well-suited to low-rank or factorized update schemes.

Load-bearing premise

The Frobenius norm of the actual update direction combined with coupled weight decay will continue to avoid shape-degenerate denominators, momentum clip saturation, and weight-decay runaway once applied to larger models or different architectures without further tuning.

What would settle it

A single training run on a model larger than 1.1B parameters or on an untested architecture where one of the three failure modes reappears and OrScale underperforms Muon.

Figures

Figures reproduced from arXiv: 2605.07815 by Yang You, Yuxuan Lou.

**Figure 1.** Figure 1: OrScale at a glance. (A) The shared Muon front end produces a spectral direction Qℓ but leaves its magnitude under-specified. (B) Three superficially natural Muon trust-ratio ports fail in distinct ways: degenerate denominator (∥Qℓ∥F is mostly shape), unit mismatch leading to upper-clip saturation (∥Mfℓ∥F has gradient units), and runaway weight-norm growth (calibrated denominator with decoupled weight deca… view at source ↗

**Figure 2.** Figure 2: CIFAR-10 / DavidNet validation top-1 versus learning rate, five main-paper baselines. OrScale tracks the top of the sweep across the entire LR grid and provides a wider stable window than Muon, consistent with the layer-adaptive gain of Theorem 3 [PITH_FULL_IMAGE:figures/full_fig_p008_2.png] view at source ↗

**Figure 3.** Figure 3: FineWeb-Edu pre-training, three optimizers. [PITH_FULL_IMAGE:figures/full_fig_p009_3.png] view at source ↗

read the original abstract

Muon improves neural-network training by orthogonalizing matrix-valued updates, but it leaves each layer's update magnitude controlled mostly by a global learning rate. We introduce OrScale, a trust-ratio extension of Muon built on a simple rule: the denominator of a layer-wise ratio should measure the Frobenius norm of the actual parameter-space direction that will be applied. This yields OrScale for general matrix layers and OrScale-LM for language models, where Moonlight shape scaling is combined with one-time per-layer calibration so every trust ratio starts at one. We analyze why three natural Muon-LAMB hybrids fail through shape-degenerate denominators, raw-momentum clip saturation, and decoupled weight-decay runaway, and show that the real-update-direction denominator with coupled weight decay avoids these failures. Theoretically, OrScale admits an O(1/sqrt(T)) nonconvex convergence guarantee in a nuclear-norm criterion, a strict layer-adaptive descent gain under measurable layer heterogeneity, and calibration properties that preserve muP-style learning-rate transfer at initialization. Empirically, OrScale ranks first on CIFAR-10/DavidNet across three seeds, improving Muon from 93.70% to 94.05% validation top-1, and OrScale-LM improves FineWeb-Edu pre-training versus Muon+Moonlight at three of four scales from 125M to 1.1B parameters while outperforming AdamW at every scale.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

OrScale adds a Frobenius-norm trust ratio to Muon plus one-time calibration to fix hybrid failure modes, with modest gains on small-scale tests but thin verification beyond 1.1B.

read the letter

OrScale takes Muon's orthogonal matrix updates and layers on a trust ratio whose denominator is the Frobenius norm of the actual parameter-space direction after orthogonalization. A one-time per-layer calibration forces every initial trust ratio to one, and weight decay stays coupled to the update. This is the concrete new piece relative to the Muon and LAMB work cited in the abstract, along with the breakdown of why three natural hybrids hit shape-degenerate denominators, raw-momentum clip saturation, or decoupled weight-decay runaway.

Referee Report

2 major / 2 minor

Summary. The paper introduces OrScale, a trust-ratio extension of the Muon optimizer that uses the Frobenius norm of the actual (orthogonalized and scaled) update direction as the layer-wise denominator, paired with coupled weight decay. It analyzes three failure modes (shape-degenerate denominators, raw-momentum clip saturation, decoupled weight-decay runaway) that arise in Muon-LAMB hybrids, derives an O(1/sqrt(T)) nuclear-norm nonconvex convergence guarantee plus layer-adaptive descent under heterogeneity, and shows calibration properties that preserve muP-style LR transfer. Empirically, OrScale improves Muon on CIFAR-10/DavidNet (93.70% to 94.05% top-1 across three seeds) and OrScale-LM outperforms Muon+Moonlight and AdamW on FineWeb-Edu pre-training at scales from 125M to 1.1B parameters.

Significance. If the central claims hold, OrScale would provide a principled, failure-avoiding mechanism for combining orthogonalization with adaptive per-layer scaling, backed by explicit convergence theory and calibration analysis. The explicit dissection of hybrid failure modes and the nuclear-norm guarantee are clear strengths that could support further theoretical work on orthogonal optimizers.

major comments (2)

[Experimental Results] Experimental section (results on FineWeb-Edu and CIFAR-10): the claim that the real-update Frobenius-norm denominator plus coupled WD avoids the three failure modes is load-bearing for the paper's motivation and generalization argument, yet no ablation on denominator choice, no monitoring of failure indicators (denominator condition number, clip frequency, effective WD magnitude), and no scaling curves beyond 1.1B parameters are reported; this leaves the avoidance claim verified only at the tested scales and architectures.
[Theoretical Analysis] Theoretical analysis (convergence and calibration sections): the O(1/sqrt(T)) nuclear-norm bound and layer-adaptive descent are derived under the maintained assumption that the denominator rule prevents the enumerated failure modes, but the proof does not explicitly verify that the one-time per-layer calibration (which forces initial trust ratios to 1) preserves these properties over long horizons or under momentum clipping.

minor comments (2)

[Methods] The pseudocode for OrScale-LM (methods section) would benefit from explicitly showing how Moonlight shape scaling is combined with the one-time calibration step.
[Preliminaries] Notation for the trust ratio in the main equations could be clarified by adding a short remark distinguishing the real-update Frobenius norm from the parameter-norm denominator used in LAMB.

Simulated Author's Rebuttal

2 responses · 1 unresolved

We thank the referee for the constructive feedback. We address each major comment point by point below and indicate the revisions planned for the next manuscript version.

read point-by-point responses

Referee: [Experimental Results] Experimental section (results on FineWeb-Edu and CIFAR-10): the claim that the real-update Frobenius-norm denominator plus coupled WD avoids the three failure modes is load-bearing for the paper's motivation and generalization argument, yet no ablation on denominator choice, no monitoring of failure indicators (denominator condition number, clip frequency, effective WD magnitude), and no scaling curves beyond 1.1B parameters are reported; this leaves the avoidance claim verified only at the tested scales and architectures.

Authors: We agree that explicit ablations and monitoring would strengthen the empirical support for failure-mode avoidance. In the revised manuscript we will add (i) a CIFAR-10 ablation comparing the Frobenius-norm denominator against shape-based and spectral-norm alternatives and (ii) training curves that track denominator condition numbers, momentum-clip frequencies, and effective weight-decay magnitudes for OrScale versus the three failing hybrids. Scaling curves beyond 1.1B parameters cannot be provided due to compute limits; we will state this limitation explicitly and discuss expected scaling behavior from the theory. revision: partial
Referee: [Theoretical Analysis] Theoretical analysis (convergence and calibration sections): the O(1/sqrt(T)) nuclear-norm bound and layer-adaptive descent are derived under the maintained assumption that the denominator rule prevents the enumerated failure modes, but the proof does not explicitly verify that the one-time per-layer calibration (which forces initial trust ratios to 1) preserves these properties over long horizons or under momentum clipping.

Authors: The convergence statements assume the OrScale rule that the denominator is always the Frobenius norm of the realized (orthogonalized, momentum-applied) update; this rule by construction precludes shape degeneracy and the other enumerated pathologies. The one-time calibration only resets initial trust ratios to 1; thereafter the denominator is recomputed from the current direction, so the assumptions remain satisfied. We acknowledge that the proof does not contain an explicit long-horizon preservation argument under clipping. The revision will add a clarifying paragraph in the theoretical section that justifies preservation from the rule definition and cross-references the new empirical clip-frequency plots. revision: yes

standing simulated objections not resolved

Scaling experiments to model sizes larger than 1.1B parameters, which exceed our current computational resources.

Circularity Check

1 steps flagged

One-time per-layer calibration forces initial trust ratios to 1 by construction

specific steps

fitted input called prediction [Abstract (OrScale-LM description)]
"Moonlight shape scaling is combined with one-time per-layer calibration so every trust ratio starts at one."

The per-layer calibration is chosen specifically to force the initial trust ratio to equal one. This makes the starting value of the very quantity (trust ratio) that the scaling method is designed to regulate a direct consequence of the calibration fit, rather than an independent outcome of the update rule.

full rationale

The paper's central denominator rule (Frobenius norm of the actual update direction) and coupled weight decay, along with the O(1/sqrt(T)) nuclear-norm convergence and layer-adaptive descent claims, are derived independently without reducing to fitted inputs. However, the OrScale-LM variant explicitly introduces a calibration step whose sole purpose is to set every initial trust ratio to one. This normalization is defined directly in terms of the trust-ratio quantity the method seeks to control, creating a moderate self-referential element in the scaling setup. No self-citations, uniqueness theorems, or ansatz smuggling appear load-bearing in the provided text. The empirical results and failure-mode analysis remain non-circular.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The central claim rests on a new scaling rule whose only explicit free parameter is the one-time per-layer calibration that forces initial trust ratios to one; the convergence statement is presented as a standard non-convex guarantee under a nuclear-norm criterion.

free parameters (1)

one-time per-layer calibration factor
Chosen so that every trust ratio begins at exactly one; this is a fitted normalization whose value is defined by the update direction itself.

axioms (1)

standard math O(1/sqrt(T)) convergence holds in the nuclear-norm sense for the non-convex setting
Invoked to support the theoretical analysis; no derivation is supplied in the abstract.

pith-pipeline@v0.9.0 · 5559 in / 1522 out tokens · 40385 ms · 2026-05-11T02:42:38.380571+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

the denominator of a layer-wise trust ratio should measure the Frobenius norm of the real parameter-space direction... nuclear-norm O(1/√T) nonconvex convergence guarantee
IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Theorem 1 (Basic OrScale convergence)... Theorem 3 (OrScale-specific lower bound on κ_eff)

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

24 extracted references · 24 canonical work pages · 9 internal anchors

[1]

arXiv:2410.21265 , year=

Jeremy Bernstein and Laker Newhouse. Modular duality in deep learning, 2024. URL https: //arxiv.org/abs/2410.21265. 9

work page arXiv 2024
[2]

Deepseek-v4: Towards highly efficient million-token context intelligence

DeepSeek-AI. Deepseek-v4: Towards highly efficient million-token context intelligence

work page
[3]

URL https://huggingface.co/deepseek-ai/DeepSeek-V4-Pro/blob/main/ DeepSeek_V4.pdf

work page
[4]

BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding

Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. Bert: Pre-training of deep bidirectional transformers for language understanding, 2019. URL https://arxiv.org/ abs/1810.04805

work page internal anchor Pith review Pith/arXiv arXiv 2019
[5]

Accurate, Large Minibatch SGD: Training ImageNet in 1 Hour

Priya Goyal, Piotr Dollár, Ross Girshick, Pieter Noordhuis, Lukasz Wesolowski, Aapo Kyrola, Andrew Tulloch, Yangqing Jia, and Kaiming He. Accurate, large minibatch sgd: Training imagenet in 1 hour, 2018. URLhttps://arxiv.org/abs/1706.02677

work page internal anchor Pith review arXiv 2018
[6]

modded-nanogpt: Speedrunning the nanogpt baseline, 2024

Keller Jordan, Jeremy Bernstein, Brendan Rappazzo, @fernbear.bsky.social, Boza Vlado, You Jiacheng, Franz Cesista, Braden Koszarsky, and @Grad62304977. modded-nanogpt: Speedrunning the nanogpt baseline, 2024. URL https://github.com/KellerJordan/ modded-nanogpt

work page 2024
[7]

Muon: An optimizer for hidden layers in neural networks, 2024

Keller Jordan, Yuchen Jin, Vlado Boza, Jiacheng You, Franz Cesista, Laker Newhouse, and Jeremy Bernstein. Muon: An optimizer for hidden layers in neural networks, 2024. URL https://kellerjordan.github.io/posts/muon/

work page 2024
[8]

Adam: A Method for Stochastic Optimization

Diederik P. Kingma and Jimmy Ba. Adam: A method for stochastic optimization, 2017. URL https://arxiv.org/abs/1412.6980

work page internal anchor Pith review Pith/arXiv arXiv 2017
[9]

arXiv preprint arXiv:1404.5997 , year=

Alex Krizhevsky. One weird trick for parallelizing convolutional neural networks, 2014. URL https://arxiv.org/abs/1404.5997

work page arXiv 2014
[10]

Sophia: A scalable stochastic second-order optimizer for language model pre-training.arXiv preprint arXiv:2305.14342, 2023

Hong Liu, Zhiyuan Li, David Hall, Percy Liang, and Tengyu Ma. Sophia: A scalable stochastic second-order optimizer for language model pre-training, 2024. URL https://arxiv.org/ abs/2305.14342

work page arXiv 2024
[11]

Muon is Scalable for LLM Training

Jingyuan Liu, Jianlin Su, Xingcheng Yao, Zhejun Jiang, Guokun Lai, Yulun Du, Yidao Qin, Weixin Xu, Enzhe Lu, Junjie Yan, Yanru Chen, Huabin Zheng, Yibo Liu, Shaowei Liu, Bohong Yin, Weiran He, Han Zhu, Yuzhi Wang, Jianzhou Wang, Mengnan Dong, Zheng Zhang, Yongsheng Kang, Hao Zhang, Xinran Xu, Yutao Zhang, Yuxin Wu, Xinyu Zhou, and Zhilin Yang. Muon is sca...

work page internal anchor Pith review arXiv 2025
[12]

Decoupled Weight Decay Regularization

Ilya Loshchilov and Frank Hutter. Decoupled weight decay regularization, 2019. URL https: //arxiv.org/abs/1711.05101

work page internal anchor Pith review Pith/arXiv arXiv 2019
[13]

The FineWeb Datasets: Decanting the Web for the Finest Text Data at Scale

Guilherme Penedo, Hynek Kydlíˇcek, Loubna Ben allal, Anton Lozhkov, Margaret Mitchell, Colin Raffel, Leandro V on Werra, and Thomas Wolf. The fineweb datasets: Decanting the web for the finest text data at scale, 2024. URLhttps://arxiv.org/abs/2406.17557

work page internal anchor Pith review arXiv 2024
[14]

Language models are unsupervised multitask learners.OpenAI, 2019

Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei, and Ilya Sutskever. Language models are unsupervised multitask learners.OpenAI, 2019. URL https://cdn.openai.com/better-language-models/language_models_are_ unsupervised_multitask_learners.pdf. Accessed: 2024-11-15

work page 2019
[15]

AdaMuon: Adaptive Muon optimizer.arXiv preprint arXiv:2507.11005, 2025

Chongjie Si, Debing Zhang, and Wei Shen. Adamuon: Adaptive muon optimizer, 2025. URL https://arxiv.org/abs/2507.11005

work page arXiv 2025
[16]

On the importance of initialization and momentum in deep learning

Ilya Sutskever, James Martens, George Dahl, and Geoffrey Hinton. On the importance of initialization and momentum in deep learning. In Sanjoy Dasgupta and David McAllester, editors,Proceedings of the 30th International Conference on Machine Learning, volume 28 of Proceedings of Machine Learning Research, pages 1139–1147, Atlanta, Georgia, USA, 17–19 Jun 2...

work page 2013
[17]

Kimi Team, Tongtong Bai, Yifan Bai, Yiping Bao, S. H. Cai, Yuan Cao, Y . Charles, H. S. Che, Cheng Chen, Guanduo Chen, Huarong Chen, Jia Chen, Jiahao Chen, Jianlong Chen, Jun Chen, Kefan Chen, Liang Chen, Ruijue Chen, Xinhao Chen, Yanru Chen, Yanxu Chen, Yicun Chen, Yimin Chen, Yingjiang Chen, Yuankun Chen, Yujie Chen, Yutian Chen, Zhirong Chen, Ziwei Che...

work page internal anchor Pith review Pith/arXiv arXiv 2026
[18]

Kimi Team, Yifan Bai, Yiping Bao, Y . Charles, Cheng Chen, Guanduo Chen, Haiting Chen, Huarong Chen, Jiahao Chen, Ningxin Chen, Ruijue Chen, Yanru Chen, Yuankun Chen, Yutian Chen, Zhuofu Chen, Jialei Cui, Hao Ding, Mengnan Dong, Angang Du, Chenzhuang Du, Dikang Du, Yulun Du, Yu Fan, Yichen Feng, Kelin Fu, Bofei Gao, Chenxiao Gao, Hongcheng Gao, Peizhong G...

work page internal anchor Pith review Pith/arXiv arXiv 2026
[19]

SOAP: Improving and Stabilizing Shampoo using Adam

Nikhil Vyas, Depen Morwani, Rosie Zhao, Mujin Kwun, Itai Shapira, David Brandfonbrener, Lucas Janson, and Sham Kakade. Soap: Improving and stabilizing shampoo using adam, 2025. URLhttps://arxiv.org/abs/2409.11321

work page internal anchor Pith review arXiv 2025
[20]

Hu, Igor Babuschkin, Szymon Sidor, Xiaodong Liu, David Farhi, Nick Ryder, Jakub Pachocki, Weizhu Chen, and Jianfeng Gao

Greg Yang, Edward J. Hu, Igor Babuschkin, Szymon Sidor, Xiaodong Liu, David Farhi, Nick Ryder, Jakub Pachocki, Weizhu Chen, and Jianfeng Gao. Tensor programs v: Tuning large neural networks via zero-shot hyperparameter transfer, 2022. URL https://arxiv.org/ abs/2203.03466

work page arXiv 2022
[21]

arXiv:2310.02244 , year=

Greg Yang, Dingli Yu, Chen Zhu, and Soufiane Hayou. Tensor programs vi: Feature learning in infinite-depth neural networks, 2023. URLhttps://arxiv.org/abs/2310.02244

work page arXiv 2023
[22]

Large batch training of convolutional networks,

Yang You, Igor Gitman, and Boris Ginsburg. Large batch training of convolutional networks,

work page
[23]

URLhttps://arxiv.org/abs/1708.03888

work page Pith review arXiv
[24]

In: International Conference on Learning Representations (ICLR) 2020 (2020).https://arxiv.org/abs/1904.00962

Yang You, Jing Li, Sashank Reddi, Jonathan Hseu, Sanjiv Kumar, Srinadh Bhojanapalli, Xiaodan Song, James Demmel, Kurt Keutzer, and Cho-Jui Hsieh. Large batch optimization for deep learning: Training bert in 76 minutes, 2020. URLhttps://arxiv.org/abs/1904.00962. A Proofs This appendix contains the full proofs that the body sketches. A.1 Proof of Lemma 1 (a...

work page arXiv 2020