Recognition: 2 theorem links
· Lean TheoremOrScale: Orthogonalised Optimization with Layer-Wise Trust-Ratio Scaling
Pith reviewed 2026-05-11 02:42 UTC · model grok-4.3
The pith
OrScale extends Muon by scaling each layer's update with the Frobenius norm of the actual parameter-space direction.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
OrScale is a trust-ratio extension of Muon built on the rule that the denominator of a layer-wise ratio should measure the Frobenius norm of the actual parameter-space direction that will be applied. This yields OrScale for general matrix layers and OrScale-LM for language models, where Moonlight shape scaling is combined with one-time per-layer calibration so every trust ratio starts at one. The real-update-direction denominator with coupled weight decay avoids the three failure modes of shape-degenerate denominators, raw-momentum clip saturation, and decoupled weight-decay runaway.
What carries the argument
The real-update-direction Frobenius-norm denominator together with coupled weight decay
If this is right
- OrScale ranks first on CIFAR-10/DavidNet across three seeds, raising Muon validation top-1 from 93.70% to 94.05%.
- OrScale-LM improves FineWeb-Edu pre-training versus Muon+Moonlight at three of four scales from 125M to 1.1B parameters.
- OrScale-LM outperforms AdamW at every tested scale.
- OrScale admits an O(1/sqrt(T)) nonconvex convergence guarantee in a nuclear-norm criterion and a strict layer-adaptive descent gain under measurable layer heterogeneity.
Where Pith is reading between the lines
- The same real-update-direction norm rule could be inserted into other orthogonalized or matrix-aware optimizers to obtain similar layer-wise adaptation.
- Calibration that forces initial trust ratios to one may simplify hyperparameter transfer when moving from small to large models in new domains.
- If the nuclear-norm convergence holds in practice it would suggest OrScale is especially well-suited to low-rank or factorized update schemes.
Load-bearing premise
The Frobenius norm of the actual update direction combined with coupled weight decay will continue to avoid shape-degenerate denominators, momentum clip saturation, and weight-decay runaway once applied to larger models or different architectures without further tuning.
What would settle it
A single training run on a model larger than 1.1B parameters or on an untested architecture where one of the three failure modes reappears and OrScale underperforms Muon.
Figures
read the original abstract
Muon improves neural-network training by orthogonalizing matrix-valued updates, but it leaves each layer's update magnitude controlled mostly by a global learning rate. We introduce OrScale, a trust-ratio extension of Muon built on a simple rule: the denominator of a layer-wise ratio should measure the Frobenius norm of the actual parameter-space direction that will be applied. This yields OrScale for general matrix layers and OrScale-LM for language models, where Moonlight shape scaling is combined with one-time per-layer calibration so every trust ratio starts at one. We analyze why three natural Muon-LAMB hybrids fail through shape-degenerate denominators, raw-momentum clip saturation, and decoupled weight-decay runaway, and show that the real-update-direction denominator with coupled weight decay avoids these failures. Theoretically, OrScale admits an O(1/sqrt(T)) nonconvex convergence guarantee in a nuclear-norm criterion, a strict layer-adaptive descent gain under measurable layer heterogeneity, and calibration properties that preserve muP-style learning-rate transfer at initialization. Empirically, OrScale ranks first on CIFAR-10/DavidNet across three seeds, improving Muon from 93.70% to 94.05% validation top-1, and OrScale-LM improves FineWeb-Edu pre-training versus Muon+Moonlight at three of four scales from 125M to 1.1B parameters while outperforming AdamW at every scale.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces OrScale, a trust-ratio extension of the Muon optimizer that uses the Frobenius norm of the actual (orthogonalized and scaled) update direction as the layer-wise denominator, paired with coupled weight decay. It analyzes three failure modes (shape-degenerate denominators, raw-momentum clip saturation, decoupled weight-decay runaway) that arise in Muon-LAMB hybrids, derives an O(1/sqrt(T)) nuclear-norm nonconvex convergence guarantee plus layer-adaptive descent under heterogeneity, and shows calibration properties that preserve muP-style LR transfer. Empirically, OrScale improves Muon on CIFAR-10/DavidNet (93.70% to 94.05% top-1 across three seeds) and OrScale-LM outperforms Muon+Moonlight and AdamW on FineWeb-Edu pre-training at scales from 125M to 1.1B parameters.
Significance. If the central claims hold, OrScale would provide a principled, failure-avoiding mechanism for combining orthogonalization with adaptive per-layer scaling, backed by explicit convergence theory and calibration analysis. The explicit dissection of hybrid failure modes and the nuclear-norm guarantee are clear strengths that could support further theoretical work on orthogonal optimizers.
major comments (2)
- [Experimental Results] Experimental section (results on FineWeb-Edu and CIFAR-10): the claim that the real-update Frobenius-norm denominator plus coupled WD avoids the three failure modes is load-bearing for the paper's motivation and generalization argument, yet no ablation on denominator choice, no monitoring of failure indicators (denominator condition number, clip frequency, effective WD magnitude), and no scaling curves beyond 1.1B parameters are reported; this leaves the avoidance claim verified only at the tested scales and architectures.
- [Theoretical Analysis] Theoretical analysis (convergence and calibration sections): the O(1/sqrt(T)) nuclear-norm bound and layer-adaptive descent are derived under the maintained assumption that the denominator rule prevents the enumerated failure modes, but the proof does not explicitly verify that the one-time per-layer calibration (which forces initial trust ratios to 1) preserves these properties over long horizons or under momentum clipping.
minor comments (2)
- [Methods] The pseudocode for OrScale-LM (methods section) would benefit from explicitly showing how Moonlight shape scaling is combined with the one-time calibration step.
- [Preliminaries] Notation for the trust ratio in the main equations could be clarified by adding a short remark distinguishing the real-update Frobenius norm from the parameter-norm denominator used in LAMB.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback. We address each major comment point by point below and indicate the revisions planned for the next manuscript version.
read point-by-point responses
-
Referee: [Experimental Results] Experimental section (results on FineWeb-Edu and CIFAR-10): the claim that the real-update Frobenius-norm denominator plus coupled WD avoids the three failure modes is load-bearing for the paper's motivation and generalization argument, yet no ablation on denominator choice, no monitoring of failure indicators (denominator condition number, clip frequency, effective WD magnitude), and no scaling curves beyond 1.1B parameters are reported; this leaves the avoidance claim verified only at the tested scales and architectures.
Authors: We agree that explicit ablations and monitoring would strengthen the empirical support for failure-mode avoidance. In the revised manuscript we will add (i) a CIFAR-10 ablation comparing the Frobenius-norm denominator against shape-based and spectral-norm alternatives and (ii) training curves that track denominator condition numbers, momentum-clip frequencies, and effective weight-decay magnitudes for OrScale versus the three failing hybrids. Scaling curves beyond 1.1B parameters cannot be provided due to compute limits; we will state this limitation explicitly and discuss expected scaling behavior from the theory. revision: partial
-
Referee: [Theoretical Analysis] Theoretical analysis (convergence and calibration sections): the O(1/sqrt(T)) nuclear-norm bound and layer-adaptive descent are derived under the maintained assumption that the denominator rule prevents the enumerated failure modes, but the proof does not explicitly verify that the one-time per-layer calibration (which forces initial trust ratios to 1) preserves these properties over long horizons or under momentum clipping.
Authors: The convergence statements assume the OrScale rule that the denominator is always the Frobenius norm of the realized (orthogonalized, momentum-applied) update; this rule by construction precludes shape degeneracy and the other enumerated pathologies. The one-time calibration only resets initial trust ratios to 1; thereafter the denominator is recomputed from the current direction, so the assumptions remain satisfied. We acknowledge that the proof does not contain an explicit long-horizon preservation argument under clipping. The revision will add a clarifying paragraph in the theoretical section that justifies preservation from the rule definition and cross-references the new empirical clip-frequency plots. revision: yes
- Scaling experiments to model sizes larger than 1.1B parameters, which exceed our current computational resources.
Circularity Check
One-time per-layer calibration forces initial trust ratios to 1 by construction
specific steps
-
fitted input called prediction
[Abstract (OrScale-LM description)]
"Moonlight shape scaling is combined with one-time per-layer calibration so every trust ratio starts at one."
The per-layer calibration is chosen specifically to force the initial trust ratio to equal one. This makes the starting value of the very quantity (trust ratio) that the scaling method is designed to regulate a direct consequence of the calibration fit, rather than an independent outcome of the update rule.
full rationale
The paper's central denominator rule (Frobenius norm of the actual update direction) and coupled weight decay, along with the O(1/sqrt(T)) nuclear-norm convergence and layer-adaptive descent claims, are derived independently without reducing to fitted inputs. However, the OrScale-LM variant explicitly introduces a calibration step whose sole purpose is to set every initial trust ratio to one. This normalization is defined directly in terms of the trust-ratio quantity the method seeks to control, creating a moderate self-referential element in the scaling setup. No self-citations, uniqueness theorems, or ansatz smuggling appear load-bearing in the provided text. The empirical results and failure-mode analysis remain non-circular.
Axiom & Free-Parameter Ledger
free parameters (1)
- one-time per-layer calibration factor
axioms (1)
- standard math O(1/sqrt(T)) convergence holds in the nuclear-norm sense for the non-convex setting
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
the denominator of a layer-wise trust ratio should measure the Frobenius norm of the real parameter-space direction... nuclear-norm O(1/√T) nonconvex convergence guarantee
-
IndisputableMonolith/Foundation/RealityFromDistinction.leanreality_from_one_distinction unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
Theorem 1 (Basic OrScale convergence)... Theorem 3 (OrScale-specific lower bound on κ_eff)
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
Jeremy Bernstein and Laker Newhouse. Modular duality in deep learning, 2024. URL https: //arxiv.org/abs/2410.21265. 9
-
[2]
Deepseek-v4: Towards highly efficient million-token context intelligence
DeepSeek-AI. Deepseek-v4: Towards highly efficient million-token context intelligence
-
[3]
URL https://huggingface.co/deepseek-ai/DeepSeek-V4-Pro/blob/main/ DeepSeek_V4.pdf
-
[4]
BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. Bert: Pre-training of deep bidirectional transformers for language understanding, 2019. URL https://arxiv.org/ abs/1810.04805
work page internal anchor Pith review Pith/arXiv arXiv 2019
-
[5]
Accurate, Large Minibatch SGD: Training ImageNet in 1 Hour
Priya Goyal, Piotr Dollár, Ross Girshick, Pieter Noordhuis, Lukasz Wesolowski, Aapo Kyrola, Andrew Tulloch, Yangqing Jia, and Kaiming He. Accurate, large minibatch sgd: Training imagenet in 1 hour, 2018. URLhttps://arxiv.org/abs/1706.02677
work page internal anchor Pith review arXiv 2018
-
[6]
modded-nanogpt: Speedrunning the nanogpt baseline, 2024
Keller Jordan, Jeremy Bernstein, Brendan Rappazzo, @fernbear.bsky.social, Boza Vlado, You Jiacheng, Franz Cesista, Braden Koszarsky, and @Grad62304977. modded-nanogpt: Speedrunning the nanogpt baseline, 2024. URL https://github.com/KellerJordan/ modded-nanogpt
work page 2024
-
[7]
Muon: An optimizer for hidden layers in neural networks, 2024
Keller Jordan, Yuchen Jin, Vlado Boza, Jiacheng You, Franz Cesista, Laker Newhouse, and Jeremy Bernstein. Muon: An optimizer for hidden layers in neural networks, 2024. URL https://kellerjordan.github.io/posts/muon/
work page 2024
-
[8]
Adam: A Method for Stochastic Optimization
Diederik P. Kingma and Jimmy Ba. Adam: A method for stochastic optimization, 2017. URL https://arxiv.org/abs/1412.6980
work page internal anchor Pith review Pith/arXiv arXiv 2017
-
[9]
arXiv preprint arXiv:1404.5997 , year=
Alex Krizhevsky. One weird trick for parallelizing convolutional neural networks, 2014. URL https://arxiv.org/abs/1404.5997
-
[10]
Hong Liu, Zhiyuan Li, David Hall, Percy Liang, and Tengyu Ma. Sophia: A scalable stochastic second-order optimizer for language model pre-training, 2024. URL https://arxiv.org/ abs/2305.14342
-
[11]
Muon is Scalable for LLM Training
Jingyuan Liu, Jianlin Su, Xingcheng Yao, Zhejun Jiang, Guokun Lai, Yulun Du, Yidao Qin, Weixin Xu, Enzhe Lu, Junjie Yan, Yanru Chen, Huabin Zheng, Yibo Liu, Shaowei Liu, Bohong Yin, Weiran He, Han Zhu, Yuzhi Wang, Jianzhou Wang, Mengnan Dong, Zheng Zhang, Yongsheng Kang, Hao Zhang, Xinran Xu, Yutao Zhang, Yuxin Wu, Xinyu Zhou, and Zhilin Yang. Muon is sca...
work page internal anchor Pith review arXiv 2025
-
[12]
Decoupled Weight Decay Regularization
Ilya Loshchilov and Frank Hutter. Decoupled weight decay regularization, 2019. URL https: //arxiv.org/abs/1711.05101
work page internal anchor Pith review Pith/arXiv arXiv 2019
-
[13]
The FineWeb Datasets: Decanting the Web for the Finest Text Data at Scale
Guilherme Penedo, Hynek Kydlíˇcek, Loubna Ben allal, Anton Lozhkov, Margaret Mitchell, Colin Raffel, Leandro V on Werra, and Thomas Wolf. The fineweb datasets: Decanting the web for the finest text data at scale, 2024. URLhttps://arxiv.org/abs/2406.17557
work page internal anchor Pith review arXiv 2024
-
[14]
Language models are unsupervised multitask learners.OpenAI, 2019
Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei, and Ilya Sutskever. Language models are unsupervised multitask learners.OpenAI, 2019. URL https://cdn.openai.com/better-language-models/language_models_are_ unsupervised_multitask_learners.pdf. Accessed: 2024-11-15
work page 2019
-
[15]
AdaMuon: Adaptive Muon optimizer.arXiv preprint arXiv:2507.11005, 2025
Chongjie Si, Debing Zhang, and Wei Shen. Adamuon: Adaptive muon optimizer, 2025. URL https://arxiv.org/abs/2507.11005
-
[16]
On the importance of initialization and momentum in deep learning
Ilya Sutskever, James Martens, George Dahl, and Geoffrey Hinton. On the importance of initialization and momentum in deep learning. In Sanjoy Dasgupta and David McAllester, editors,Proceedings of the 30th International Conference on Machine Learning, volume 28 of Proceedings of Machine Learning Research, pages 1139–1147, Atlanta, Georgia, USA, 17–19 Jun 2...
work page 2013
-
[17]
Kimi Team, Tongtong Bai, Yifan Bai, Yiping Bao, S. H. Cai, Yuan Cao, Y . Charles, H. S. Che, Cheng Chen, Guanduo Chen, Huarong Chen, Jia Chen, Jiahao Chen, Jianlong Chen, Jun Chen, Kefan Chen, Liang Chen, Ruijue Chen, Xinhao Chen, Yanru Chen, Yanxu Chen, Yicun Chen, Yimin Chen, Yingjiang Chen, Yuankun Chen, Yujie Chen, Yutian Chen, Zhirong Chen, Ziwei Che...
work page internal anchor Pith review Pith/arXiv arXiv 2026
-
[18]
Kimi Team, Yifan Bai, Yiping Bao, Y . Charles, Cheng Chen, Guanduo Chen, Haiting Chen, Huarong Chen, Jiahao Chen, Ningxin Chen, Ruijue Chen, Yanru Chen, Yuankun Chen, Yutian Chen, Zhuofu Chen, Jialei Cui, Hao Ding, Mengnan Dong, Angang Du, Chenzhuang Du, Dikang Du, Yulun Du, Yu Fan, Yichen Feng, Kelin Fu, Bofei Gao, Chenxiao Gao, Hongcheng Gao, Peizhong G...
work page internal anchor Pith review Pith/arXiv arXiv 2026
-
[19]
SOAP: Improving and Stabilizing Shampoo using Adam
Nikhil Vyas, Depen Morwani, Rosie Zhao, Mujin Kwun, Itai Shapira, David Brandfonbrener, Lucas Janson, and Sham Kakade. Soap: Improving and stabilizing shampoo using adam, 2025. URLhttps://arxiv.org/abs/2409.11321
work page internal anchor Pith review arXiv 2025
-
[20]
Greg Yang, Edward J. Hu, Igor Babuschkin, Szymon Sidor, Xiaodong Liu, David Farhi, Nick Ryder, Jakub Pachocki, Weizhu Chen, and Jianfeng Gao. Tensor programs v: Tuning large neural networks via zero-shot hyperparameter transfer, 2022. URL https://arxiv.org/ abs/2203.03466
-
[21]
Greg Yang, Dingli Yu, Chen Zhu, and Soufiane Hayou. Tensor programs vi: Feature learning in infinite-depth neural networks, 2023. URLhttps://arxiv.org/abs/2310.02244
-
[22]
Large batch training of convolutional networks,
Yang You, Igor Gitman, and Boris Ginsburg. Large batch training of convolutional networks,
-
[23]
URLhttps://arxiv.org/abs/1708.03888
-
[24]
Yang You, Jing Li, Sashank Reddi, Jonathan Hseu, Sanjiv Kumar, Srinadh Bhojanapalli, Xiaodan Song, James Demmel, Kurt Keutzer, and Cho-Jui Hsieh. Large batch optimization for deep learning: Training bert in 76 minutes, 2020. URLhttps://arxiv.org/abs/1904.00962. A Proofs This appendix contains the full proofs that the body sketches. A.1 Proof of Lemma 1 (a...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.